![]() |
||||||||||||||
|
|
||||||||||||||
Research in language and literature: Old problems, new solutions
Author: Luis Guerra Universidad Europea de MadridKeywords:Language Industry, linguistic, literature, humanities, research tools, information technology, textual data base, search programme, teaching and learning, communication. Article style and source: Paper presented at the Conference in Bergen, The Future of Humanities in the Digital Age, September 25-28, 1998.
Contents
AbstractThe principal ideas of the article are the following ones:
Humanities and Language IndustryWhen the term "Language Industry" (Industrias de la lengua, Les industries de la langue) is used, it can be applied to different activities which should be defined. On the one hand, Language Industry refers to the support industry of communication; in this sense, printing is undoubtedly the first major step in industrialisation. The typewriter, the fax, the e-mail, the word-proccesor... are tools that appear in this process of industrialisation. Language Industry also deals with the activities (products and process) that manipulate language. It is an industry primarily linked to computers. According to Sager (1994), if the linguistic point of view is stressed, the field involved is Computational Linguistics. However, if the applications of Computer Technology are the main objective, the result is Natural Language Processing. On the other hand, Language Industry is an industry which produces language as well: the generation of automatic abstracts, automatic translation, the production of languages in the form of indexing and documentation languages, etc. These different perspectives coincide in reconciling distant and even opposite subjects. So, Linguistics and Literature, which are the basis of Humanities as they are generally known, become closely linked to Telecommunication Engineering and Computers. I am interested in the approach that Language Industry causes as it connects with our scientific tradition. The unity of thought and the close link among different branches of knowledge are held during the Enlightenment Age. For instance, Gaspar Melchor de Jovellanos (1744-1810), one of the most representative men in the Enlightened Spain of that period, founded the Royal Asturian Institute. This was a technical institute that formed naval and miner technicians. Jovellanos most original idea was to consider humanities as the main subject for students. So, in his Prayer about the Necessary Unification between Literary Study and Sciences, he writes:
"This meeting so long desired and never well established in our imperfect educational methods may seem odd to some, impossible to others and perhaps to you [students from Royal Asturian Institute] useless and unprofitable" (Jovellanos, 19783: 207). In my opinion, these comments can be validated nowadays. Jovellanos believed that scientific disciplines arrive at and hold ideas while Linguistics and Literature describe them. The knowledge of the world around us is achieved through science. Linguistics and Literature allow us to communicate and to spread it, giving it new forms in order to grasp all its nuances. Jovellanos also writes in Prayer about "the intimate union which all human knowledge has, whose intuition and comprehension must be the unique objective of our study, because without it all knowledge is in vain" (Jovellanos, 19783: 210). From my point of view, one of the most interesting ideas that Language Industry offers to scholars is that of reuniting the different branches of specific and specialized knowledge, without forgetting its core. This re-link of various fields of study that Language Industry is able to show, leads me to redefine the term "Humanities". This redefinition of the concept of Humanities is closer to the original meaning of that term. Moreover, it is linked to the purpose of this congress. For both of these reasons, I would like to explain it now. F. Savater (1997) thinks that Humanities should be used when talking of a methodology and not of traditional contents (Literature, Latin, Greek, etc.). The humanistic method develops curiosity, critical analysis, logical reasonig. This method considers the perception of partial knowledge as a part of a whole and unique knowledge. It does not matter whether the teaching curricula is Greek or Physics. It has to do with logics. During the Renaissance, Humanities was used in opposition to studies of Theology; Humanities was related to human studies and Theology was used to discuss God and here the principle of faith is a dogma. In other words, Humanities as a concept is opposite to Divinity and not to Inhumanity. The separation between Science and Humanities is quite recent. The word "Philosophy" was used for a long time (v. gr. see Descartes or Hume) for what we now consider scientific knowledge, that is, Physics and Natural Science. So, "Physics" was called "Natural Philosophy" and the expression "Grammatical Philosophy" meant "Scientific Grammar", as N. Chomsky notes (1988). During the last decades of the nineteenth century Science and Literature began to be considered as different branches of knowledge as a consequence of the large quantity of concepts acquired. Humanistic education considers reason above all, without taking into account the subject implied. This idea helps us to place the discussion whithin boundaries, and it contains the most valuable dimension of topics such as Language Industry. In summarizing, Language Industry deals with both humanistic and scientific subjects, separated only in the most recent scientific tradition. This binding is linked to our cultural tradition, that advocates the essential unity of human knowledge. In this sense, Language Industry is a real humanistic discipline. The future of Humanities, the present of Humanities, must be seen in the original sense of this word. Language Industry is by definition the conjunction of disciplines traditionally considered separate in Arts and Science. It is a humanistic teaching (i.e. that refers to knowledge related to human beings) and therefore it help us to understand knowledge as a whole. It was thus considered in the Enlightenment Age. content Linguistic Norm and StyleTwo aspects of General Linguistics (and in another level of particular grammar of each language) that I consider relevant in training of translators and interpreters are norm and style. There is an enormous bibliography concerning it and very different approaches to these two areas, and this may be confusing when reading about it. Firstly I will define these two terms and afterwards I will talk about how computerized tools may help in translation and writing; finally I will refer to the concepts of norm and style inherent in the computer programmes I will expand on. The translator needs norms: he/she wants to know if a certain word is correctly written, if a syntactic structure is possible, if there is a precise meaning to a precise form. Norms are, in the first place, all those rules that must be followed to use a language in a correct way. Therefore the concept of norm implies that a language may be subjected to a wrong use: not only does wrong linguistic usage exist but also it is frequent, and qualified writers (i.e. translators) have to pay attention so as to avoid it. The existence of a norm supposes that there are different linguistic forms that compete among themselves when speakers and writers construct their messages. M. Alvar (1981) established that the definition of norm belonged to two major groups. The first one associates norm to an ideal of correct usage, taken from literary language, from the linguistic uses of dominant classes (social prestige) and from the linguistic uses of the most educated (idiomatic correction). The second one identifies norm with regular usage, that is, how regular speakers use the language. This identification conveys the idea of uniformity. M. Alvar considers norm a social fact and defends the plurality of norms in every language: "If there is a ëcorrectí and ëunitaryí norm, that is because others can be defined as dissenting and solvent. Then a social fact is relevant: the existence of variety according to the belonging of the individual to a certain group" (Alvar, 1981: 38). Moreover, in the beginning style is a characteristic manner (of an individual, a period, a genre, a situation) of speaking and writing. The relationship between norm and style is problematic. The stylistics of deviation considers style as linguistic usage independent of norm. When a writer employs this linguistic usage he is looking for a predeterminate effect. The stylistics of deviation puts the concepts of grammaticality and agrammaticality in relation with those of norm and style, respectively. Sometimes style is studied within the norm: the speaker would have different options, identically correct, so as to formulate his/her messages; this is the conception of style as a choice, within a group of rules. M. Alvar has referred to this problem in a literary way, when he talks of language as imprisonment and liberation: "The code is external to the individual that receives and employes it, but it accepts the imprint everyone writes on it" (Alvar, 1981: 13). In the classroom I usually solve this problem with a pragmatic and deliberately easy method without forgetting didactic issues that are implicated. A language allows for many different linguistic uses; only some of them are standard, correct or more frequent. Usage is flexible enough to let a branch of new possibilities. Thus style appears as variety allowed by norm, and the latter bounds the initial variety permited in every linguistic system, every language. For instance, (1) * Habían cinco personas en la fiesta (2) Había cinco personas en la fiesta (2a) En la fiesta había cinco personas (2b) Cinco personas, había en la fiesta (2c) Había en la fiesta cinco personas The user of the language (the translator) must know that 1 is not standard and that 2 is standard and why. choose among 2a, 2b or 2c depending on his interests of communication. To neglect 1 and to accept 2 is a fact of norm. To choose among 2a, 2b or 2c is a fact of style. How may computerized tools help us? What concepts of norm and style were underlying the existing programmes? The Grammatik program is the one I better know best, and I will proceed to talk about it. I will evaluate its benefits for students and its underlying implicit theory. Grammatik, defined as a checker of grammar and style, implicitly associates grammar to norm. This norm is based on the criterion of correctness. The concept of style that it internalizes is that one of the stylistics of choice. This concept studies certain linguistic features (at the semantic, syntactic, morphological and lexical level) as the main constituent of a definite style. It is not by chance that Grammatik takes into account this conception of style, sine it allows the quantification, that is, the statistical measurement of occurrences of certain linguistic units. These become elements of style. Enkvist (1985: 139) proposes something similar in order to discover features of style. He counts how many times this feature appears in a text fragment, and he compares it with the occurrences of the feature mentioned in a text defined as stylistically relevant, that is, "normative". So Grammatik establishes a certain number as a maximum limit of occurrences of a linguistic feature in a given text fragment. The user may change the value of the limit. The quantification forces us to be precise and this is an advantage of the programme as apprenticeship element. We may characterize the exact linguistic features that define a certain style. However, Enkvist wrote in a later work (1994: 116) that features of style thar are impossible to quantify may be found in the semantic and superstructural levels of a text. In these features the users perceive the lackness of the programme. Grammatik has ten writing styles, and they agree with texts types . Moreover, it allows the user to choose among three levels of formality (standard, formal and informal). The parallelism of Grammatik with the traditional rhetoric doctrine is obvious. The rhetorical genres (genus iudiciale, deliberativum and demonstrativum), the arts during the Middle Ages as well (ars praedicandi, ars dictandi and ars poetriae, the two latter only said of written texts) are types of discourse. In this sense these types are perfectly comparable to the ten styles of writing offered by Grammatik. On the other hand, the three levels of formality of Grammatik remind us of the genera elocutionis, that is, the elocution styles or registers (genus humile, genus medium, genus sublime). These are known as plain style, medium style and sublime style. In the traditional rhetoric system every matter (rhetorical genre) has its correspondence with a style (elocution genre). Differents facts may change the equivalences. For example, the characteristic genre of Christian Literature, the ars praedicandi, requires the highest style because of its high tone (the religious dogma). However, since the Middle Ages the genus humile is preferred in order to facilitate the teaching of religious doctrine. In the same way, every style of writing of Grammatik is always put together with a level of formality. So, the technical style is put together with the formal level, the advertising style with the informal, and the general style with the standard one, and the user may modify them. content Annotated EditionsIn the past few years, the more complete and rigorous annotated editions include an electronic text. In the more elaborated versions, this text constitutes a powerful tool to trace different ways through the textual framework. The edited text establishes the text, it takes down the variants of the different testimonies of the same work and it clarifies the meanings with its notes. It also draws the classical works closer to modern readers providing different levels of reading. The electronic text must faithfully reproduce the definite version on paper. It allows searches and correspondences of all types, fitting to particular interests of every reader. We don´t search for in it somebody else´s erudition, but to satisfy our curiosity and survey the text in new ways, in order to find in it new prospects. So, the electronic text becomes a complement, but never a substitute of the edited critical edition. This CD-ROM contains the Cervantes text, without the footnotes and the
complementary notes of the printed edition. The scholarly studies of Cervantes
included in the prologue and in the complementary volume have also been
excluded. The main menu offers us three options:
1. It lets us read the text, that corresponds line to line with the printed
edition, except for the syllabe partition of words, suppressed in the
electronic text.
2. The "Interrogation" option opens different possibilities. We can find
a word (or part of a word through the use of gadget characters). For instance,
we may know that mar and tierra appear in the first part
33 and 146 times, respectively. When we have the forms placed, the programme
lets us generate its contexts (concordances) in the format desired. We
can also locate the contexts, precising the part, chapter, page and line.
The option "Family finding" is a search that implies finding two or more
words from the text, associated by logical operators (AND, OR, AND NOT).
In this way, with AND we can search the passages where mar and
tierra appear together within a maximum predefined distance (4
times in the first part separated by a maximum of three words). With the
"calculation of coocurrences" (known in Computational Linguistics as Mutual
Information, MI) DBT computes the probability (statistical coocurrence)
that one or more words have of being associated within a text with other
words. A special type of statistical coocurrences is the analysis of the
prepositions, that allows usto study the prepositions that govern verbs
and nouns.
3. The third general option is "Index", that allows the user to obtain
indexes of different types (alphabetic, inverse alphabetic, according
to the absolute frequency, indicating the location of the forms, etc.)
and to make statistical estimations of every structural unit of the text.
Therefore, the searches comprise not only lexical units (words) but also
units of every level of analysis, defined through rules by the user. The
results may be printed or saved in a file.
The main advantage of the programme is its flexibility. It adapts to
the personal interests of every student, increasing his / her autonomy,
and allowing the confrontation with the text to be personalized. In a
way the interest of the programme begins where the traditional critical
edition ends: the programme permits corroborating intuitions, to verify
the hypothesis, and to check the most interesting items.
My opinion of the programme is very positive. Despite this, however,
students face common problem to all technological tools: as users they
do not profit from all the possibilities of the programme. Sometimes they
lack clear objectives although they are aware the functions of the programme.
This problem leads us to the existence of two types of intelligence (J.A.
Marina, 1993: pgs. 15-28): one we can call "computational", that receives
information, works on it and produces an answer, and one that we can simply
call "human". This human intelligence, besides being computational is
autonomous, it is based on liberty, and it creates information and invents
its own goals. This programme is irrelevant for the development of human
intelligence, but the programme opens all its possibilities when it is
governed by the human intelligence. As philosophers have reminded us at
the XXth Worldwide Congress of Philosophy held in Boston last August,
all these procedures of reckoning and data recollection must respond to
clarity of mind. This is the common reflection on all technology: the
authentic formation lies in the intellectual comprehension of the environment.
Then the intellection may extend to the technical instruments. Technology
is the result of clarity of mind, and not its substitute. content
In this section I will comment on the characteristics and applications
of two linguistic corpora supported by the Real Academia Española,
a project which has not been finished yet: The Corpus of Reference of
Current Spanish (CREA) and the Spanish Diacronic Corpus (CORDE).
The CREA offers researches a representative sample of current standard
Spanish. Its modulate structure allows great flexibility in the searches,
that can be made with geographical, generic, temporary and thematic requirements.
In summarizing, the CREA includes texts from the last 25 years (1975-1999).
These texts are from America and Spain in 50%. however, Central and South
American linguists criticized this percentage in the display of the corpus
held in Madrid last March. If we discuss genre, written texts make up
90% of the texts and 10% are oral texts 10%. The contents are arranged
in thematic areas: Science and Technology; Social Sciences; Religion and
Thought; Politics and Economics; Arts; Leisure and Ordinary Life; Health;
Fiction. The final size for the entire corpus is estimated at 125 million
words.
The language of codification chosen for this corpus is SGML (Standard
Generalized Markup Language). This is a common code for electronic texts
of the nineties (and in the near future it will become XML, a support
of the new Internet standard). The codification of written texts occurs
at two levels. In the first one the text receives a formal mark, usually
in an automatic way. For instance, the heading of the text introduces
bibliographical information and documentation from the electronic text.
In the body of the text basic structural marks are indicated (paragraph
and page number) and intratextual basic marks are registered (notes, corrections,
mistakes, formulas and tables, etc.). In the second level intratextual
information is added (i.e., internal structure of text, quotations, direct
speech, metalanguage, foreign words, etc.). The scheme of codification
of oral texts is comparable in its complexity to the second level of written
ones. The marks are noted during the transcription. There are structural
marks that indicate, for instance, turns in the speech, as well as non-structural
marks (overlapping, tottering, anacoluthon, etc.).
At the present time, the linguistic annotation has been made after the
second level of codification (it affects only one million of words of
the corpus). In future the linguistic annotation will be made after the
first level of codification, and in the second one, linguistic information
will be available.
CORDE will alsoconsist of 125 million words in its final version. These
words will comprise the history of Spanish from its beginnings until 1975,
when CORDE will link to CREA. CORDE is a corpus of written texts which
includes, as CREA does, complete texts. It also uses the SGML language
of codification. The internal structure of the corpus takes into account
criteria of three types: chronological ( the texts are gathered in three
periods time, The Middle Ages, Golden Age and Contemporary Age, which
are subdivided in shorter units of time); geographical (the texts come
from all parts of the world where Spanish is spoken or has been spoken;
the peninsular texts comprise 74% of the corpus, as opposed to the 26%
of Spanish textsfrom the rest of the world); and generic (the texts are
divided in two large groups, fiction ó44%ó and non-fiction
ó56%ó with further subdivisions).
Although CREA and CORDE are complementary, and the corpus of the former
will increase the corpus of the latter in time, there are differences
between both corpora. Therefore, in contrast to CORDE, CREA does not include
any text in verse. This explains the different marking system of the corpora,
which in CORDE is broader and more precise.
These corpora allows us to check the performance of linguistic structures
considered incorrect or anomalous. The corpora also establish where they
appear and how powerful they are in contrast to standard sequences. We
also know the possibilities of the system that have been already carried
out and have been unsuccessful, perhaps because of reasons external to
linguistic mechanisms. The questions of norm and style commented on the
section II are examined here with a different perspective. Humboldt believed
that language constantly changed; following this line of thought, the
search for a representative corpus lets us study the alternatives of the
system to solve the same communicative problem, when they arise, and where
they have more vitality.
Let us see some examples of searches in CREA:
The sequence verb of movement and the prepositions a and por
(voy a por pan) is not normative in Spanish. The Real Academia
(Esbozo..., pg. 436) notes that it began to extend into the popular
speech of Spain during the second half of the nineteenth century, and
that it was developed in Spain but not in America. A CREA search clarifies
a lot: voy a por appears 72 times in 19 different documents, all
of them in Spain, but never in a text from an American country. The corpus
corroborates the Academic appreciation.
The sequence of prepositions por contra is considered in current
Spanish a gallicism, i.e., a combination that imitates a similar French
sequence. It is advisable to avoid it and to use instead the correct expression
por el contrario. CREA allows to expand on the description of the
phenomenon with new data: this construction is much more frequent in Spain
(510 occurences) than in America (10 occurences). Moreover it is specially
frequent in the thematic area of "journalism" (400 occ.). The examples
in American texts are very recent (they have been documented since 1994).
So, the phenomenon has extended from French to European Spanish, mainly
through the language of the press. Recently this sequence has been found
in American Spanish (6 of the 10 cases in American texts appear in the
press and all of them occur in Venezuela, and refer to Spanish sports
news, which may indicate that it coud be a dubious fact).
The searches concerning lexical units are also conclusive. It used to
be thought that forms with the root implement- (-ar, -to, -ación,
the last one not collected in the dictionary of the RAE, but in recent
dictionaries of usage, for instance, in the Gran Diccionario de la
Lengua Española of Larousse) have more vitality in the Spanish
of America than in that of Spain. The manuals of style used by Spanish
media usually censor this anglicism (i.e., Manual de español
urgente of the Agencia EFE). The data derived from CREA is revealing:
of the 1211 occurences of the lexical family implement- only 63
have been in Spanish texts, and have occurred since 1982. It is an opposite
case to the above studied por contra. The form implementación
appears specifically in CREA 359 times, and 23 of these belong to Spanish
texts (only in 14 different documents and since 1987). content
To sum up we may conclude as follows:
Agencia Efe (1991/8): Manual de español urgente, Madrid,
Cátedra.
Alcaraz, E. and M.A. Martínez (1997): Diccionario de lingüística
moderna, Barcelona, Ariel.
Alvar, M. (1982): La lengua como libertad, Madrid, Ed. Cultura
Hispánica.
Azauste, A. and J. Casas (1997): Manual de retórica española,
Barcelona, Ariel.
Cervantes, M. (1998): Don Quijote de La Mancha, ed. dirigida
por F. Rico, Barcelona, Instituto Cervantes-Crítica.
Chomskey, N. (1988): El lenguaje y los problemas del conocimiento,
Madrid, Visor.
Enkvist, N. (1985): "Estilística, lingüística del
texto y composición" en Bernárdez, E. (comp.): Lingüística
del texto, Madrid, Arco Libros, 1987, pp. 131-150.
Enkvist, N. (1994): "The epistemic gap in the linguistic stylistics",
in Winter, W. (ed.): On Languages and Language, Berlin, Mouton
de Gruyter, pp. 109-126.
Gran Diccionario de la Lengua Española (1996), prólogo
de F. Rico, Barcelona, Larousse Planeta.
Jovellanos, G.M. (1978/3): Obras en prosa. Edición de José
Caso González, Madrid, Castalia.
Marina, J.A. (1993): Teoría de la inteligencia creadora,
Barcelona, Anagrama.
Real Academia Española (1992/21): Diccionario de la lengua
española, Madrid, Espasa Calpe.
Sager, J.L. (1994): Language Engineering and Translation: Consequences
of Automation, Amsterdam, John Benjamins.
Savater, F. (1997): El valor de educar, Barcelona, Ariel.
|
||||||||||||||
| Send feedback to
manager@ultibase.rmit.edu.au Copyright © 2001 Faculty of Education Language and Community Services Document URL: http://ultibase.rmit.edu.au/Articles/dec98/guerra1.htm Last Updated: 08-December-1998 by Marita Mueller |
|
|||||||||||||