Research Resources

Translation Resources

French Commercial-Legal-Diplomatic Translation: General
French Diplomatic Translation: Treaties
French Medical Translation
Spanish Commercial, Legal Translation
Spanish Medical Translation

Terminology and Language Technology Resources

The Handbook of Terminology Management
Standards for Content Creation and Globalization
The Information Cycle Feedback Loop
Knowledge Organization Sources and Systems

Corpus Resources

Corpora are electronic bodies of linguistic data(texts) that linguists extract (isolate from their larger texts) and concordance (align by keyword) to generate natural language samples for term, phrase or syntax modeling. Corpora can help translators empirically verify their intuitions about sense, connotation and near-synonymy, show patterns of actual frequencies or potential language use, reveal the lexical density of a text (particularly in translation research), identify semantic prosodies (connotations) and semantic preferences (the “clustering” of words around certain poles of meaning), and assist in overcoming imperfect overlap in collocational ranges across languages. Hatim and Munday (2005) map corpora in translation use as an interface with the language engineering discipline. Customized corpora may be generated with leaseable software, while “found” corpora—some in the multimillions of words—are available on web-based concordancing sites.

A Corpus Linguistics Glossary


Alias:
A user-designated synonym for a Unix command or sequence of commands. For example, if you designated m to be your alias for mailx, then typing m will always run this mail program.

Alignment: The matching or linking of a text and its translation(s), usually paragraph by paragraph and/or sentence. Texts are often aligned in this way so that bilingual CONCORDANCES can be retrieved. Some alignment can be done automatically by software, although best results are usually produced when a human user checks the automatic alignment and corrects where necessary.

Alphanumeric: Of ASCII characters, any string composed of only up-or lower-case English letters or Arabic numerals.

AMALGAM (Automatic Mapping Among Lexicon Grassmmatical Annotation Models)

Anaphora: Pronouns, noun phrases, etc. which refer back to something already mentioned in a text; sometimes the term is used more loosely—and, technically, incorrectly— to refer in general to items which co-refer, even when they do not occur in the text itself (expophora) or when they refer forwards rather than backwards in the text (cataphora).

Annotation: (1) The practice of adding explicit additional information to the machine-readable text; (2) The physical representation of such information.

ARCHER: a Representative Corpus of Historical English Registers

ASCII: The American Standard Code for Information Interchange is a standard character set that maps character codes 0 through 127 (low ASCII) onto control functions, punctuation marks, digits, upper case letters, and other symbols.

Attribute: In SGML, a quantifier within the opening tag for an element which specifies a value for some named property of that element.

Authenticity: a feature that characterizes naturally occurring corpus data

BFT (Binary File Transfer): A way of sending files by ftp. The are sent in binary code, not translated into ASCII, which would risk some information loss.

CALL: computer-aided (or assisted) language learning

CAMET: Computer Archive of Modern English Texts, a project of Geoffery Leech of the Department of Language and Modern English in 1970.

Character encoding: a system of using numeric values to represent characters

COCOA (Computations in Commutative Algebra): A method of text encoding used by the Oxford Concordance Program and other software.

Colligation: the collocation of a node word with a particular grammatical class of words

Collocation: the characteristic co-occurrence of patterns of words

Comparable corpus: a corpus which is composed of L1 data collected from different languages using the same sampling techniques

Comparative corpus: a corpus containing components of varieties of the same language

Concordance: an alphabetical index of a search pattern in a corpus, showing every contextual occurrence of the search pattern

Corpus balance: the range of different types of language that a corpus claims to cover

Corpus header: the part of a corpus that provides necessary bibliographical information, taxonomies used and other metadata relating to a corpus

Corpuses: a less commonly used plural form of corpus

Cross-tabulation: a table showing the frequencies for each variable across each sample
co-text A more precise term than context or verbal context used to refer to the words on either side of a selected word or phrase.

Dispersion: a term in descriptive statistics which refers to a quantifiable variation of measurements of differing members of a population within the scale on which they are measured

Ditto tag: in corpus annotation assigning the same part-of-speech code to each word in an idiomatic expression

DTD: Document Type Definitions in markup languages such as HTML, SGML and XML

Error-tagging: assigning codes indicating the types of errors occurring in a learner corpus

Factor analysis: a statistical analysis commonly used in the social and behavioural sciences to summarize the interrelationships among a large group of variables in a concise fashion fisher's exact test: an alternative to the chi-square or log-likelihood test that measures exact statistical significance level

Frequency: also called raw frequency, the actual count of a linguistic feature in a corpus

Interlanguage: the learner’s knowledge of the L2 which is independent of both the L1 and the actual L2

Keyword: words in a corpus whose frequency is unusually high (positive keywords) or low (negative keywords) in comparison with a reference corpus

KWIC: key-word-in-context concordance

Lemmatization: grouping together all of the different inflected forms of the same word

Lexicon: an inventory of word forms in a given language

Log-likelihood test: also known as an LL test, an alternative to the chi-square test

Markup: a system of standard codes inserted into a document stored in electronic form to provide information about the text itself and govern formatting, printing or other processing

Mean: the arithmetic average, which can be calculated by adding all of the scores together and then dividing the sum by the number of scores

Merger: combination of two or more words (e.g. can’t and gonna)

Metadata: a term used to describe data about data, typically the contextual information of corpus samples

MI: mutual information, a statistical formula borrowed from information theory

Microconcord: a concordance package published the Oxford University Press

Monitor corpus: a corpus that is constantly supplemented with fresh material and keeps increasing in size

Normalization: a process which makes frequencies from samples of markedly different sizes comparable by bringing them to a common base

Parallel corpus: a corpus which is composed of source texts and their translations in one or more different languages;also known as a translation corpus

Parsing: also called treebanking or bracketing, a process that analyzes the sentences in a corpus into their constituents

Population
: the entire set of items from which samples can be drawn

POS: part-of-speech

Post-editing: human correction of automatically processed data

Range: the difference between the highest and lowest frequencies

Reference corpus: a balanced representative corpus balanced for general usage; in keyword analysis, a corpus that is used to provide a reference wordlist

Representativeness: a corpus is thought to be representative of the language variety it is supposed to represent if the findings based on its contents can be generalized to the said language variety
recoverability A term used to refer to the possibility for the user to recover the basic original text from any text which has been annotated with further information.

RP: Received Pronunciation, the notional standard form of spoken British English

Sample: elements that are selected intentionally as a representation of the population being studied

Sample corpus: as opposed to a monitor corpus, a sample corpus is of finite size and consists of text segments selected to provide a static snapshot of language

Semantic prosody: the collocational meaning arising from the interaction between a given node word and its collocates

SEU: Survey of English Usage

Skeleton parsing: also called shallow parsing, a parsing technique that uses less fine-grained constituent types rather than would be present in a full parse

Sort: arrange concordances or a wordlist in a certain order

Span: a term used to refer to the measurement, in words, of the co-text of a word selected for study.

Specialized corpus: a corpus that is domain or genre specific and is designed to represent a sublanguage

SPSS: Statistical Package for the Social Sciences

Standardized type-token ratio: similar to type-token ratio, but computed every n (e.g. 1,000) words as the WordSmith Wordlist goes through each text file

Sub corpus: a component of a corpus, usually defined using certain criteria such as text types and domains

Tagging: an alternative term for annotation, especially word-level annotation such as POS tagging and semantic tagging

Tagset: a collection of tags in the form of a scheme for annotating corpora.

Text chunking: the practice of dividing sentences into non-overlapping segments on the basis of fairly superficial analysis

Token: an occurrence of any given word form

Tokenization: also called segmentation, a process that divides running text into legitimate word tokens, especially important for languages such as Chinese that do not delimit words with white spaces

Transcription: converting spoken data into a written form

Treebank: an alternative term for a parsed corpus

T-test: an alternative statistical test to the chi-square test

Type: a word form

Type-token ratio: the ratio between types and tokens, useful when comparing samples of roughly equal length

Unicode: a character encoding system designed to support the interchange, processing, and display of all of the written texts of the diverse languages of the world

Wildcard: a special character such as an asterisk (*) or a question mark (?) that can be used to represent one or more characters in pattern matching

Wordlist: a list of words occurring in a corpus, possibly with frequency information

WordSmith: a corpus exploration package with sophisticated statistical analysis, published by the Oxford University Press

Z-test: an alternative statistical test to chi-square test


References:

Baker, Paul, Andrew Hardie & Tony McEnery. A Glossary of Corpus Linguistics. Edinburgh: Edinburgh University Press, 2006.

Olohan, Maeve. Introducing Corpora in Translation Studies. New York: Routledge, 2004

Wang, Kefei. Research and Application of Bilingual Parallel Corpora. Beijing: Foreign Language Teaching and Research Press, 2004


A Translator's Reading List of Corpus-related Works

Adab, B. (2001). The Translation of Advertising: A Framework for Evaluation. Babel, 47(2), 133-157.

Al-Khafaji, R. (2006). In Search of Translational Norms: The Case of Shifts in Lexical Repetition in Arabic-English Translations. Babel, 52(1), 39-65.

Alvarez Lugris, A. (2001). Tectra: Theory and Practice of Research Using Corpora in the Framework of Translation Studies. TRANS. Revista de Traductologia, 5, 185-204.

Alvarez, M. A. (1998). Fidelity to the Original in Literary Translation: Micro- and Macro-Analysis of Translational Phenomena. TRANS. Revista de Traductologia, 2, 67-81.

Alves, F. (2003). Translation, Cognition and Contextualization: Triangulating the Process-Product Interface in the Performance of Novice Translators. Revista de Documentação de Estudos em Linguistica Teorica e Aplicada (D.E.L.T.A.), 19(special), 71-108.

Alves, F., & Magalhaes, C. (2004). Using Small Corpora to Tap and Map the Process-Product Interface in Translation. TradTerm, 10, 179-211.

Araujo, V. L. S. (2004). To Be or Not to Be Natural: Cliches of Emotion in Screen Translation. Meta, 49(1), 161-171.

Baker, M. (1995). Corpora in Translation Studies: An Overview and Some Suggestions for Future Research. Target, 7(2), 223-243.
              (1998). Revisiting the Language of Translation: A Corpus-Based Approach. Meta, 43(4), 480-485.
              (1999). The Role of Corpora in Investigating the Linguistic Behaviour of Professional Translators. International Journal of Corpus Linguistics, 4(2), 281-298.
              (2004). A Corpus-Based View of Similarity and Difference in Translation. International Journal of Corpus Linguistics, 9(2), 167-193.
              (2004). The Treatment of Variation in Corpus-Based Translation Studies. Gothenburg Studies in English, 89, 7-17.
              

Baroni, M., & Bernardini, S. (2006). A New Approach to the Study of Translationese: Machine-Learning the Difference between Original and Translated Text. Literary and Linguistic Computing, 21(3), 259-274.

Bernardini, S. (2003). Designing a Corpus for Translation and Language Teaching: The CEXI Experience. TESOL Quarterly, 37(3), 528-537.
                  (2005). Reviving Old Ideas: Parallel and Comparable Analysis in Translation Studies-With an Example from Translation Stylistics. Gothenburg Studies in English, 90, 5-27.

Bosseaux, C. (2004). Point of View in Translation: A Corpus-Based Study of French Translations of Virginia Woolf's To the Lighthouse. Across Languages and Cultures, 5(1), 107-122.

Bowker, L. (1998). Corpus Exploitation Focused Terminological Research. Terminologies Nouvelles, 18(June), 22-27.
                (1998). Using Specialized Monolingual Native-Language Corpora as a Translation Resource: A Pilot Study. Meta, 43(4),631-651.
                (1999). Exploring the Potential of Corpora for Raising Language Awareness in Student Translators. Language Awareness, 8(3-4), 160-173.
               (2000). A Corpus-Based Approach to Evaluating Student Translations. Translator, 6(2), 183-210.
               (2000). Towards a Methodology for Exploiting Specialized Target Language Corpora as Translation Resources. International Journal of Corpus Linguistics, 5(1), 17-52.
               (2001). Terminology and Gender Sensitivity: A Corpus-Based Study of the LSP of Infertility. Language in Society, 30(4), 589-610.
               (2001). Towards a Methodology for a Corpus-Based Approach to Translation Evaluation. Meta, 46(2), 345-364.
               (2002). An Empirical Investigation of the Terminology Profession in Canada in the 21st Century. Terminology, 8(2), 283-308.
               (2004). Corpus Resources for Translators: Academic Luxury or Professional Necessity? TradTerm, 10, 213-247.
Bowker, L., & Hawkins, S. (2006). Variation in the Organization of Medical Terms: Exploring Some Motivations for Term Choice. Terminology, 12(1), 79-110.

Brownlie, S. (2002). The Translation of Philosophical Terminology. Meta, 47(3), 295-310.
                 (2006). Narrative Theory and Retranslation Theory. Across Languages and Cultures, 7(2), 145-170.

Calzada Perez, M. (2001). A Three-Level Methodology for Descriptive-Explanatory Translation Studies. Target, 13(2), 203-239.

Cao, D. (2006). Research on Chinese-Japanese Parallel Corpus and Translation Study. Foreign Language Teaching and Research, 38(3),221-226.

Colominas, C. (2004). The Use of New Technologies for Translation Instruction with Special Attention to Translational Corpora [The Possibilities Afforded by the BancTrad]. Lebende Sprachen, 49(2), 64-67.

Corpas Pastor, G. (2001). The Compilation of an ad hoc Corpus for Instruction in Specialized Translation into a Nonnative Language. TRANS. Revista de Traductologia, 5, 155-184.

Cosme, C. (2006). Clause Combining across Languages: A Corpus-Based Study of English-French Translation Shifts. Languages in Contrast, 6(1), 71-108.

Friedbichler, I., & Friedbichler, M. (1997). Corpus-Supported Translation beyond the Word Boundaries. On the Significance of Electronic Text Corpora and Concordance Programs for Translation. Lebende Sprachen, 42(2), 49-53.

Gonzalez Liano, I. (2001). Translation and Genre: The Feminism of Rosalia de Castro in English Translation. Viceversa, 7-8, 109-130.

Halverson, S. (1998). Translation Studies and Representative Corpora: Establishing Links between Translation Corpora, Theoretical/Descriptive Categories and a Conception of the Object of Study. Meta, 43(4), 494-514.

Kohn, J. (1999). The Computer-Assisted Study of Parallel Texts in Translation Education. Lebende Sprachen, 44(1), 6-14.

Kruger, A. (2004). Shakespeare in Afrikaans: A Corpus-Based Study of Involvement in Different Registers of Drama Translation. Language Matters, 35(1), 275-294.

Laviosa, S. (1997). How Comparable Can 'Comparable Corpora' Be? Target, 9(2), 289-319.
               (1998). Core Patterns of Lexical Use in a Comparable Corpus of English Narrative Prose. Meta, 43(4), 557-570.
               (1998). The Corpus-Based Approach: A New Paradigm in Translation Studies. Meta, 43(4), 474-479.
               (2004). Corpus-Based Translation Studies: Where Does It Come From? Where Is It Going? TradTerm, 10, 29-57.
               (2004). Corpus-Based Translation Studies: Where Does It Come From? Where Is It Going? Language Matters, 35(1), 6-27.
               (2004). When Italians Talk 'Business' They Mean It. TradTerm, 10, 279-293.

Lopez Arroyo, B., Fernandez Antolin, M. J., & de Felipe Boto, R. (2007). Contrasting the rhetoric of abstracts in medical discourse: Implications and applications for English/Spanish translation. Languages in Contrast, 7(1), 1-28.

Lores Sanz, R. (2003). The Translation of Tourist Literature: The Case of Connectors. Multilingua, 22(3), 291-308.

Machniewski, M. (2004). Contrastive Functional Analysis as a Starting Point for Corpus-Based Translation Studies (CTS): Methodological Considerations for Analysing Small Translational Corpora. Language Matters, 35(1), 102-118.

Mauranen, A. (2004). Contrasting Languages and Varieties with Translational Corpora. Languages in Contrast, 5(1), 73-92.

Monacelli, C. (2006). Implications of Translational Shifts in Interpreter-Mediated Texts. Pragmatics, 16(4), 457-473.

Montero-Martinez, S., & Garcia de Quesada, M. (2003). Terminological Analysis for Translation. Perspectives: Studies in Translatology, 11(4), 293-314.

Munday, J. (1998). A Computer-Assisted Approach to the Analysis of Translation Shifts. Meta, 43(4), 542-556.
                (2002). Translation Studies and Corpus Linguistics: An Interface for Interdisciplinary Co-Operation. Logos and Language: Journal of General Linguistics and Language Theory, 3(1), 11-20.

Olohan, M. (2002). Corpus Linguistics and Translation Studies: Interaction and Reaction. Linguistica Antverpiensia, 1, 419-429.

Ramon Garcia, N. (2002). Contrastive Linguistics and Translation Studies Interconnected: The Corpus-Based Approach. Linguistica Antverpiensia, 1, 393-406.

Saridakis, I. E., & Kostopoulou, G. (2007). Modern Trends in the Pedagogy of Specialised Translation: LSP, Text Typology and the Use of IT Tools. Linguistic Insights - Studies in Language and Communication, 47, 573-584.

Serban, A. (2004). Presuppositions in Literary Translation: A Corpus-Based Approach. Meta, 49(2), 327-342.

Sinclair, J., Payne, J., & Perez Hernandez, C. (1996). Corpus to Corpus: A Study of Translation Equivalence. International Journal of Lexicography, 9(3), 171-178.

Tsuji, K. (2002). Automatic Extraction of Translational Japanese-KATAKANA and English Word Pairs from Bilingual Corpora. International Journal of Computer Processing of Oriental Languages, 15(3), 261-279.

Utka, A. (2004). Phases of Translation Corpus: Compilation and Analysis. International Journal of Corpus Linguistics, 9(2), 195-224.

Vintar, S. (2001). Using Parallel Corpora for Translation-Oriented Term Extraction. Babel, 47(2), 121-132.

Waddington, C. (2006). Measuring the Effect of Errors on Translation Quality. Lebende Sprachen, 51(2), 67-71.

Wang, J.-H., Teng, J.-W., Lu, W.-H., & Chien, L.-F. (2006). Exploiting the Web as the Multilingual Corpus for Unknown Query Translation. Journal of the American Society for Information Science and Technology, 57(5), 660-670.

Wu, A., & Huang, L. (2006). On Corpus-Based Studies of Translation Universals. Foreign Language Teaching and Research, 38(5),296-302.

 Zanettin, Federico (2002) "DIY Corpora: The WWW and the Translator In Maia, Belinda / Haller, Jonathan / Urlrych, Margherita (eds.) Training the Language Services Provider for the New Millennium, Porto: Facultade de Letras, Universidade do Porto, 239-248.

           (2002) "Corpora for translation practice". In Elia Yuste-Rodrigo (ed.) Language Resources for Translation Work and Research, LREC 2002 Workshop Proceedings, Las Palmas de Gran Canaria, 10-14.

          (2002) "CEXI. Designing an English Italian Translational Corpus". In Ketteman, Bernhard / Marko, Georg (eds.) Teaching and Learning by Doing Corpus Analysis. Amsterdam: Rodopi, 329-343.

Journals:

International Journal of Corpus Linguistics (1996-)

(Published twice per year by John Benjamins. Papers on corpora for linguistics and human language technology.
)

Corpora http://www.eupjournals.com/journal/cor

(A new journal of corpus linguistics focusing on the many and varied uses of corpora both in linguistics and beyond. The journal accepts articles presenting research findings based on the exploitation of corpora as well as accounts of corpus building, corpus tool construction and corpus annotation schemes.)