[Home]
 
[Current Issue] [About Us] [Subscribe] [Search] [Events] [Resources]
 

Research in language and literature: Old problems, new solutions

[Spanish  Version]

Author: Luis Guerra

Universidad Europea de Madrid

Keywords:Language Industry, linguistic, literature, humanities, research tools, information technology, textual data base, search programme, teaching and learning, communication.

Article style and source: Paper presented at the Conference in Bergen, The Future of Humanities in the Digital Age, September 25-28, 1998.


Contents


Abstract

The principal ideas of the article are the following ones:
  1. Language Industry is a real humanistic discipline, that links technological, scientific and humanistic knowledge. Language Industry considers knowledge as a whole, derived from human reasoning.
  2. The programmes that check the correction and the style of our writings are implicit or explicitly based in stylistics of choice, that allows us to quantify the facts of style.
  3. The modern editions of annotated texts offer new possibilities. The printed text goes with an electronic text that combines a textual data base with a search programme, and that lets the reader analyse the text in different ways.
  4. The most recent linguistic corpora are powerful tools that simplify descriptive Linguistics. They allow the researcher´s judgments and assumptions to be founded on empirical data.
This paper develops some experiences and ideas derived from my classes at the Universidad Europea de Madrid. I teach Language Industry, a subject which is taught during third year in the Translation Studies degree.

Humanities and Language Industry

When the term "Language Industry" (Industrias de la lengua, Les industries de la langue) is used, it can be applied to different activities which should be defined.

On the one hand, Language Industry refers to the support industry of communication; in this sense, printing is undoubtedly the first major step in industrialisation. The typewriter, the fax, the e-mail, the word-proccesor... are tools that appear in this process of industrialisation.

Language Industry also deals with the activities (products and process) that manipulate language. It is an industry primarily linked to computers. According to Sager (1994), if the linguistic point of view is stressed, the field involved is Computational Linguistics. However, if the applications of Computer Technology are the main objective, the result is Natural Language Processing.

On the other hand, Language Industry is an industry which produces language as well: the generation of automatic abstracts, automatic translation, the production of languages in the form of indexing and documentation languages, etc.

These different perspectives coincide in reconciling distant and even opposite subjects. So, Linguistics and Literature, which are the basis of Humanities as they are generally known, become closely linked to Telecommunication Engineering and Computers.

I am interested in the approach that Language Industry causes as it connects with our scientific tradition. The unity of thought and the close link among different branches of knowledge are held during the Enlightenment Age. For instance, Gaspar Melchor de Jovellanos (1744-1810), one of the most representative men in the Enlightened Spain of that period, founded the Royal Asturian Institute. This was a technical institute that formed naval and miner technicians. Jovellanos most original idea was to consider humanities as the main subject for students. So, in his Prayer about the Necessary Unification between Literary Study and Sciences, he writes:

"This meeting so long desired and never well established in our imperfect educational methods may seem odd to some, impossible to others and perhaps to you [students from Royal Asturian Institute] useless and unprofitable" (Jovellanos, 19783: 207).

In my opinion, these comments can be validated nowadays. Jovellanos believed that scientific disciplines arrive at and hold ideas while Linguistics and Literature describe them. The knowledge of the world around us is achieved through science. Linguistics and Literature allow us to communicate and to spread it, giving it new forms in order to grasp all its nuances.

Jovellanos also writes in Prayer about

"the intimate union which all human knowledge has, whose intuition and comprehension must be the unique objective of our study, because without it all knowledge is in vain" (Jovellanos, 19783: 210).

From my point of view, one of the most interesting ideas that Language Industry offers to scholars is that of reuniting the different branches of specific and specialized knowledge, without forgetting its core. This re-link of various fields of study that Language Industry is able to show, leads me to redefine the term "Humanities". This redefinition of the concept of Humanities is closer to the original meaning of that term. Moreover, it is linked to the purpose of this congress. For both of these reasons, I would like to explain it now.

F. Savater (1997) thinks that Humanities should be used when talking of a methodology and not of traditional contents (Literature, Latin, Greek, etc.). The humanistic method develops curiosity, critical analysis, logical reasonig. This method considers the perception of partial knowledge as a part of a whole and unique knowledge. It does not matter whether the teaching curricula is Greek or Physics. It has to do with logics.

During the Renaissance, Humanities was used in opposition to studies of Theology; Humanities was related to human studies and Theology was used to discuss God and here the principle of faith is a dogma. In other words, Humanities as a concept is opposite to Divinity and not to Inhumanity. The separation between Science and Humanities is quite recent. The word "Philosophy" was used for a long time (v. gr. see Descartes or Hume) for what we now consider scientific knowledge, that is, Physics and Natural Science. So, "Physics" was called "Natural Philosophy" and the expression "Grammatical Philosophy" meant "Scientific Grammar", as N. Chomsky notes (1988). During the last decades of the nineteenth century Science and Literature began to be considered as different branches of knowledge as a consequence of the large quantity of concepts acquired.

Humanistic education considers reason above all, without taking into account the subject implied. This idea helps us to place the discussion whithin boundaries, and it contains the most valuable dimension of topics such as Language Industry.

In summarizing, Language Industry deals with both humanistic and scientific subjects, separated only in the most recent scientific tradition. This binding is linked to our cultural tradition, that advocates the essential unity of human knowledge. In this sense, Language Industry is a real humanistic discipline. The future of Humanities, the present of Humanities, must be seen in the original sense of this word.

Language Industry is by definition the conjunction of disciplines traditionally considered separate in Arts and Science. It is a humanistic teaching (i.e. that refers to knowledge related to human beings) and therefore it help us to understand knowledge as a whole. It was thus considered in the Enlightenment Age. content

Linguistic Norm and Style

Two aspects of General Linguistics (and in another level of particular grammar of each language) that I consider relevant in training of translators and interpreters are norm and style. There is an enormous bibliography concerning it and very different approaches to these two areas, and this may be confusing when reading about it. Firstly I will define these two terms and afterwards I will talk about how computerized tools may help in translation and writing; finally I will refer to the concepts of norm and style inherent in the computer programmes I will expand on.

The translator needs norms: he/she wants to know if a certain word is correctly written, if a syntactic structure is possible, if there is a precise meaning to a precise form. Norms are, in the first place, all those rules that must be followed to use a language in a correct way. Therefore the concept of norm implies that a language may be subjected to a wrong use: not only does wrong linguistic usage exist but also it is frequent, and qualified writers (i.e. translators) have to pay attention so as to avoid it. The existence of a norm supposes that there are different linguistic forms that compete among themselves when speakers and writers construct their messages.

M. Alvar (1981) established that the definition of norm belonged to two major groups. The first one associates norm to an ideal of correct usage, taken from literary language, from the linguistic uses of dominant classes (social prestige) and from the linguistic uses of the most educated (idiomatic correction).

The second one identifies norm with regular usage, that is, how regular speakers use the language. This identification conveys the idea of uniformity. M. Alvar considers norm a social fact and defends the plurality of norms in every language:

"If there is a ëcorrectí and ëunitaryí norm, that is because others can be defined as dissenting and solvent. Then a social fact is relevant: the existence of variety according to the belonging of the individual to a certain group" (Alvar, 1981: 38).

Moreover, in the beginning style is a characteristic manner (of an individual, a period, a genre, a situation) of speaking and writing. The relationship between norm and style is problematic. The stylistics of deviation considers style as linguistic usage independent of norm. When a writer employs this linguistic usage he is looking for a predeterminate effect. The stylistics of deviation puts the concepts of grammaticality and agrammaticality in relation with those of norm and style, respectively. Sometimes style is studied within the norm: the speaker would have different options, identically correct, so as to formulate his/her messages; this is the conception of style as a choice, within a group of rules.

M. Alvar has referred to this problem in a literary way, when he talks of language as imprisonment and liberation:

"The code is external to the individual that receives and employes it, but it accepts the imprint everyone writes on it" (Alvar, 1981: 13).

In the classroom I usually solve this problem with a pragmatic and deliberately easy method without forgetting didactic issues that are implicated. A language allows for many different linguistic uses; only some of them are standard, correct or more frequent. Usage is flexible enough to let a branch of new possibilities. Thus style appears as variety allowed by norm, and the latter bounds the initial variety permited in every linguistic system, every language. For instance,

(1) * Habían cinco personas en la fiesta

(2) Había cinco personas en la fiesta

(2a) En la fiesta había cinco personas

(2b) Cinco personas, había en la fiesta

(2c) Había en la fiesta cinco personas

The user of the language (the translator) must

know that 1 is not standard and that 2 is standard and why.

choose among 2a, 2b or 2c depending on his interests of communication.

To neglect 1 and to accept 2 is a fact of norm. To choose among 2a, 2b or 2c is a fact of style.

How may computerized tools help us? What concepts of norm and style were underlying the existing programmes? The Grammatik program is the one I better know best, and I will proceed to talk about it. I will evaluate its benefits for students and its underlying implicit theory.

Grammatik, defined as a checker of grammar and style, implicitly associates grammar to norm. This norm is based on the criterion of correctness. The concept of style that it internalizes is that one of the stylistics of choice. This concept studies certain linguistic features (at the semantic, syntactic, morphological and lexical level) as the main constituent of a definite style. It is not by chance that Grammatik takes into account this conception of style, sine it allows the quantification, that is, the statistical measurement of occurrences of certain linguistic units. These become elements of style.

Enkvist (1985: 139) proposes something similar in order to discover features of style. He counts how many times this feature appears in a text fragment, and he compares it with the occurrences of the feature mentioned in a text defined as stylistically relevant, that is, "normative". So Grammatik establishes a certain number as a maximum limit of occurrences of a linguistic feature in a given text fragment. The user may change the value of the limit.

The quantification forces us to be precise and this is an advantage of the programme as apprenticeship element. We may characterize the exact linguistic features that define a certain style. However, Enkvist wrote in a later work (1994: 116) that features of style thar are impossible to quantify may be found in the semantic and superstructural levels of a text. In these features the users perceive the lackness of the programme.

Grammatik has ten writing styles, and they agree with texts types . Moreover, it allows the user to choose among three levels of formality (standard, formal and informal). The parallelism of Grammatik with the traditional rhetoric doctrine is obvious. The rhetorical genres (genus iudiciale, deliberativum and demonstrativum), the arts during the Middle Ages as well (ars praedicandi, ars dictandi and ars poetriae, the two latter only said of written texts) are types of discourse. In this sense these types are perfectly comparable to the ten styles of writing offered by Grammatik.

On the other hand, the three levels of formality of Grammatik remind us of the genera elocutionis, that is, the elocution styles or registers (genus humile, genus medium, genus sublime). These are known as plain style, medium style and sublime style. In the traditional rhetoric system every matter (rhetorical genre) has its correspondence with a style (elocution genre). Differents facts may change the equivalences. For example, the characteristic genre of Christian Literature, the ars praedicandi, requires the highest style because of its high tone (the religious dogma). However, since the Middle Ages the genus humile is preferred in order to facilitate the teaching of religious doctrine. In the same way, every style of writing of Grammatik is always put together with a level of formality. So, the technical style is put together with the formal level, the advertising style with the informal, and the general style with the standard one, and the user may modify them. content

Annotated Editions

In the past few years, the more complete and rigorous annotated editions include an electronic text. In the more elaborated versions, this text constitutes a powerful tool to trace different ways through the textual framework. The edited text establishes the text, it takes down the variants of the different testimonies of the same work and it clarifies the meanings with its notes. It also draws the classical works closer to modern readers providing different levels of reading. The electronic text must faithfully reproduce the definite version on paper. It allows searches and correspondences of all types, fitting to particular interests of every reader. We don´t search for in it somebody else´s erudition, but to satisfy our curiosity and survey the text in new ways, in order to find in it new prospects. So, the electronic text becomes a complement, but never a substitute of the edited critical edition.

Don Quijote de la Mancha by Institute Cervantes. The textual analysis programme of this edition appears in a CD-ROM that comprises the complete text of Don Quijote de la Mancha, in the aforementioned edition, directed by Francisco Rico, along with a programme to handle it. It is called Data Base Testuale, written by Eugenio Picchi, from the Istituto di Linguistica Computazionale del Consiglio Nazionalle delle Ricerche di Pisa, adapted to Spanish by Joan Torruella and Carme Planas, from the Philology and Computer Seminar of the Universidad Autónoma de Barcelona. I will comment on the characteristics of this programme and its didactic possibilities.

This CD-ROM contains the Cervantes text, without the footnotes and the complementary notes of the printed edition. The scholarly studies of Cervantes included in the prologue and in the complementary volume have also been excluded. The main menu offers us three options:

1. It lets us read the text, that corresponds line to line with the printed edition, except for the syllabe partition of words, suppressed in the electronic text.

2. The "Interrogation" option opens different possibilities. We can find a word (or part of a word through the use of gadget characters). For instance, we may know that mar and tierra appear in the first part 33 and 146 times, respectively. When we have the forms placed, the programme lets us generate its contexts (concordances) in the format desired. We can also locate the contexts, precising the part, chapter, page and line. The option "Family finding" is a search that implies finding two or more words from the text, associated by logical operators (AND, OR, AND NOT). In this way, with AND we can search the passages where mar and tierra appear together within a maximum predefined distance (4 times in the first part separated by a maximum of three words). With the "calculation of coocurrences" (known in Computational Linguistics as Mutual Information, MI) DBT computes the probability (statistical coocurrence) that one or more words have of being associated within a text with other words. A special type of statistical coocurrences is the analysis of the prepositions, that allows usto study the prepositions that govern verbs and nouns.

3. The third general option is "Index", that allows the user to obtain indexes of different types (alphabetic, inverse alphabetic, according to the absolute frequency, indicating the location of the forms, etc.) and to make statistical estimations of every structural unit of the text.

Therefore, the searches comprise not only lexical units (words) but also units of every level of analysis, defined through rules by the user. The results may be printed or saved in a file.

The main advantage of the programme is its flexibility. It adapts to the personal interests of every student, increasing his / her autonomy, and allowing the confrontation with the text to be personalized. In a way the interest of the programme begins where the traditional critical edition ends: the programme permits corroborating intuitions, to verify the hypothesis, and to check the most interesting items.

My opinion of the programme is very positive. Despite this, however, students face common problem to all technological tools: as users they do not profit from all the possibilities of the programme. Sometimes they lack clear objectives although they are aware the functions of the programme. This problem leads us to the existence of two types of intelligence (J.A. Marina, 1993: pgs. 15-28): one we can call "computational", that receives information, works on it and produces an answer, and one that we can simply call "human". This human intelligence, besides being computational is autonomous, it is based on liberty, and it creates information and invents its own goals. This programme is irrelevant for the development of human intelligence, but the programme opens all its possibilities when it is governed by the human intelligence. As philosophers have reminded us at the XXth Worldwide Congress of Philosophy held in Boston last August, all these procedures of reckoning and data recollection must respond to clarity of mind. This is the common reflection on all technology: the authentic formation lies in the intellectual comprehension of the environment. Then the intellection may extend to the technical instruments. Technology is the result of clarity of mind, and not its substitute. content

New Tools of Research in Linguistics and Literature

In this section I will comment on the characteristics and applications of two linguistic corpora supported by the Real Academia Española, a project which has not been finished yet: The Corpus of Reference of Current Spanish (CREA) and the Spanish Diacronic Corpus (CORDE).

The CREA offers researches a representative sample of current standard Spanish. Its modulate structure allows great flexibility in the searches, that can be made with geographical, generic, temporary and thematic requirements.

In summarizing, the CREA includes texts from the last 25 years (1975-1999). These texts are from America and Spain in 50%. however, Central and South American linguists criticized this percentage in the display of the corpus held in Madrid last March. If we discuss genre, written texts make up 90% of the texts and 10% are oral texts 10%. The contents are arranged in thematic areas: Science and Technology; Social Sciences; Religion and Thought; Politics and Economics; Arts; Leisure and Ordinary Life; Health; Fiction. The final size for the entire corpus is estimated at 125 million words.

The language of codification chosen for this corpus is SGML (Standard Generalized Markup Language). This is a common code for electronic texts of the nineties (and in the near future it will become XML, a support of the new Internet standard). The codification of written texts occurs at two levels. In the first one the text receives a formal mark, usually in an automatic way. For instance, the heading of the text introduces bibliographical information and documentation from the electronic text. In the body of the text basic structural marks are indicated (paragraph and page number) and intratextual basic marks are registered (notes, corrections, mistakes, formulas and tables, etc.). In the second level intratextual information is added (i.e., internal structure of text, quotations, direct speech, metalanguage, foreign words, etc.). The scheme of codification of oral texts is comparable in its complexity to the second level of written ones. The marks are noted during the transcription. There are structural marks that indicate, for instance, turns in the speech, as well as non-structural marks (overlapping, tottering, anacoluthon, etc.).

At the present time, the linguistic annotation has been made after the second level of codification (it affects only one million of words of the corpus). In future the linguistic annotation will be made after the first level of codification, and in the second one, linguistic information will be available.

CORDE will alsoconsist of 125 million words in its final version. These words will comprise the history of Spanish from its beginnings until 1975, when CORDE will link to CREA. CORDE is a corpus of written texts which includes, as CREA does, complete texts. It also uses the SGML language of codification. The internal structure of the corpus takes into account criteria of three types: chronological ( the texts are gathered in three periods time, The Middle Ages, Golden Age and Contemporary Age, which are subdivided in shorter units of time); geographical (the texts come from all parts of the world where Spanish is spoken or has been spoken; the peninsular texts comprise 74% of the corpus, as opposed to the 26% of Spanish textsfrom the rest of the world); and generic (the texts are divided in two large groups, fiction ó44%ó and non-fiction ó56%ó with further subdivisions).

Although CREA and CORDE are complementary, and the corpus of the former will increase the corpus of the latter in time, there are differences between both corpora. Therefore, in contrast to CORDE, CREA does not include any text in verse. This explains the different marking system of the corpora, which in CORDE is broader and more precise.

These corpora allows us to check the performance of linguistic structures considered incorrect or anomalous. The corpora also establish where they appear and how powerful they are in contrast to standard sequences. We also know the possibilities of the system that have been already carried out and have been unsuccessful, perhaps because of reasons external to linguistic mechanisms. The questions of norm and style commented on the section II are examined here with a different perspective. Humboldt believed that language constantly changed; following this line of thought, the search for a representative corpus lets us study the alternatives of the system to solve the same communicative problem, when they arise, and where they have more vitality.

Let us see some examples of searches in CREA:

The sequence verb of movement and the prepositions a and por (voy a por pan) is not normative in Spanish. The Real Academia (Esbozo..., pg. 436) notes that it began to extend into the popular speech of Spain during the second half of the nineteenth century, and that it was developed in Spain but not in America. A CREA search clarifies a lot: voy a por appears 72 times in 19 different documents, all of them in Spain, but never in a text from an American country. The corpus corroborates the Academic appreciation.

The sequence of prepositions por contra is considered in current Spanish a gallicism, i.e., a combination that imitates a similar French sequence. It is advisable to avoid it and to use instead the correct expression por el contrario. CREA allows to expand on the description of the phenomenon with new data: this construction is much more frequent in Spain (510 occurences) than in America (10 occurences). Moreover it is specially frequent in the thematic area of "journalism" (400 occ.). The examples in American texts are very recent (they have been documented since 1994). So, the phenomenon has extended from French to European Spanish, mainly through the language of the press. Recently this sequence has been found in American Spanish (6 of the 10 cases in American texts appear in the press and all of them occur in Venezuela, and refer to Spanish sports news, which may indicate that it coud be a dubious fact).

The searches concerning lexical units are also conclusive. It used to be thought that forms with the root implement- (-ar, -to, -ación, the last one not collected in the dictionary of the RAE, but in recent dictionaries of usage, for instance, in the Gran Diccionario de la Lengua Española of Larousse) have more vitality in the Spanish of America than in that of Spain. The manuals of style used by Spanish media usually censor this anglicism (i.e., Manual de español urgente of the Agencia EFE). The data derived from CREA is revealing: of the 1211 occurences of the lexical family implement- only 63 have been in Spanish texts, and have occurred since 1982. It is an opposite case to the above studied por contra. The form implementación appears specifically in CREA 359 times, and 23 of these belong to Spanish texts (only in 14 different documents and since 1987). content

Conclusions

To sum up we may conclude as follows:

  1. Language Industry is a field of knowledge that links technological, scientific and humanistic knowledge, in the traditional sense of the last word. Furthermore Language Industry becomes a real humanistic discipline, that considers knowledge as a whole, and derived from human reasoning.

  2. The programmes that check the correction and the style of our writings are implicit or explicitly based in the stylistics of choice. This Stylistics defines a certain style based on the repetition of some linguistic features and allows us to quantify the facts of style. In the programme Grammatik we also find a parallelism with classical rhetoric. The programme proposes different styles of writing and levels of formality to the user. They are equivalent to the rhetoric genres and to the elocution registers of the traditional rhetoric system.

  3. The modern editions of annotated texts offer new possibilities. The printed text is presented with an electronic text that combines a textual data base with a search programme. The most elaborate programmes let the reader analyze the text in different ways and corroborate any of his/her intuitions. The electronic text must literally follow the printed text, and its function is to complement the traditional critical edition, never to substitute it.

  4. These applications never substitute the personal initiative of the researcher. They are tools that develop all the possibilities of the text when they are handled by a free individual that creates his/her own goals.

  5. The most recent linguistic corpora are powerful tools that simplify descriptive linguistics. They allow the researcher´s judgements and assumptions to be founded on empirical data. Rigour is crucial in the process of design, configuration and coding of the corpus in order to obtain reliable data. These linguistic corpora let us focus classical problems of research such as norm and style in new ways.

  6. Training researchers in linguistics and literature must bring together classical contents of the curriculum with a deep knowledge of these new applications, not only as users but also as collaborators in its design and theoretical foundations. content

Bibliography

Agencia Efe (1991/8): Manual de español urgente, Madrid, Cátedra.

Alcaraz, E. and M.A. Martínez (1997): Diccionario de lingüística moderna, Barcelona, Ariel.

Alvar, M. (1982): La lengua como libertad, Madrid, Ed. Cultura Hispánica.

Azauste, A. and J. Casas (1997): Manual de retórica española, Barcelona, Ariel.

Cervantes, M. (1998): Don Quijote de La Mancha, ed. dirigida por F. Rico, Barcelona, Instituto Cervantes-Crítica.

Chomskey, N. (1988): El lenguaje y los problemas del conocimiento, Madrid, Visor.

Enkvist, N. (1985): "Estilística, lingüística del texto y composición" en Bernárdez, E. (comp.): Lingüística del texto, Madrid, Arco Libros, 1987, pp. 131-150.

Enkvist, N. (1994): "The epistemic gap in the linguistic stylistics", in Winter, W. (ed.): On Languages and Language, Berlin, Mouton de Gruyter, pp. 109-126.

Gran Diccionario de la Lengua Española (1996), prólogo de F. Rico, Barcelona, Larousse Planeta.

Jovellanos, G.M. (1978/3): Obras en prosa. Edición de José Caso González, Madrid, Castalia.

Marina, J.A. (1993): Teoría de la inteligencia creadora, Barcelona, Anagrama.

Real Academia Española (1992/21): Diccionario de la lengua española, Madrid, Espasa Calpe.

Sager, J.L. (1994): Language Engineering and Translation: Consequences of Automation, Amsterdam, John Benjamins.

Savater, F. (1997): El valor de educar, Barcelona, Ariel.


About the Author

Luis Guerra
Departamento de Filología Española
Universidad Europea de Madrid
Villaviciosa de Odón
28670 Madrid, Spain
Email: luis.guerra@esp.fil.uem.es


Copyright © Luis Guerra, 1998. For uses other than personal research or study, as permitted under the Copyright Laws of your country, permission must be negotiated with the author. Any further publication permitted by the author must include full acknowledgement of first publication in ultiBASE (http://ultibase.rmit.edu.au). Please contact the Editor of ultiBASE for assistance with acknowledgement of subsequent publication.
[up]
Send feedback to manager@ultibase.rmit.edu.au
Copyright © 2001 Faculty of Education Language and Community Services
Document URL: http://ultibase.rmit.edu.au/Articles/dec98/guerra1.htm
Last Updated: 08-December-1998 by Marita Mueller
[RMIT University]
 
current II subscribe II about II search II events II resources