A Medical Multilingual Information Retrieval
Edson Jos´e Pacheco23, Percy Nohama23, Stefan Schulz1, Korn´el Mark´o1
1Freiburg University Hospital, Department of Medical Informatics, Freiburg, Germany
2Paran´a Catholic University, Health Informatics Laboratory, Curitiba, Brazil
3CEFET-PR, Graduate Program in Electrical Engineering and Industrial Informatics, Curitiba, Brazil
Abstract. The Web is full of documents and resources. Users employ differentstrategies to find information they need: by browsing, using search engines, byfollowing existing categories in a Web catalog. For technical sublanguages suchas the medical one, document indexing based on lexical entities at a subwordlevel has proved useful. However, it still remains challenging to identify and todelimit the meaningful lexical entities, as well as to group them in synonymyclasses. We present a lexicographic and semantic foundation underlying themultilingual MORPHOSAURUS lexicon.Resumo. A Web ´e repleta de documentos e outros recursos. Usu´arios utilizamdiferentes estrat´egias para encontrar as informac¸˜oes que eles necessitam: nave-gando entre sites, usando m´aquinas de busca ou usando catal´ogos de dom´ınio. Para linguagens t´ecnicas como a linguagem m´edica, a indexac¸˜ao de documen-tos usando entidades lexicais em n´ıvel estruturante menor que a palavra temse mostrado ´util. Por´em, ainda h´a desafios com relac¸˜ao `a indentificac¸˜ao edelimitac¸˜ao de entidades lexicais apropriadas, assim como ao agrupamentoem conjuntos de sinˆonimos. Apresentamos os fundamentos lexicogr´aficos esemˆanticos do l´exico multil´ıngue MORPHOSAURUS.
The problem of Information Retrieval is that users have to spend a lot of time and effortto navigate but may not find the information required, and sometimes, the informationis lost during the document navigation process. In the medical context, the problem isexacerbated because the information sources are too numerous. Various works are re-lated to medical information retrieval, normally supported by a full or a semi-automatedindexation of documents. However, these are works aiming only the acquisition of in-formation and not the evaluation of these data. In this work, we want to treat efficientrecovery of medical information, allowing the recuperation in many formats (like med-ical documents, guidelines, knowledge representations, etc), extracting the knowledgeguiding users when they browse and search information, represents the meaning ele-ments of the textual sources by means of controlled vocabularies, thesauri and ontolo-gies, based upon a rich inventory of biomedical terminologies such as provided by theUMLS, OBO and others. To do that we need a multilingual thesaurus, in our case amedical domain thesaurus, to process documents wrote in one language and builds docu-ments without language dependence (using an artificial language). To specify an artificiallanguage is necessary deeply knowledge about syntactic and semantic structure of hu-man languages. The conventional view on human language is word-centered, at leastfor written language where words are clearly delimited by spaces. It builds on the hy-pothesis that words are the basic building blocks of phrases and sentences. In syntactic
theories words constitute the terminal symbols. To break down natural language to theword level appears, therefore, straightforward. When we look at the sense of natural lan-guage expressions, however, we find much evidence that semantic atomicity frequentlydoes not coincide with the word level. As an example, in the English term high bloodpressure the word limits reflect quite well the semantic composition, whereas this is notthe case in its literal translations verhoogde bloeddruk (Dutch) or bluthochdruck (Ger-man). Especially in technical sublanguages we encounter atomic senses at different levelsof fragmentation or granularity. An atomic sense may correspond to word stems (e.g.,hepat-), prefixes (e.g., anti-), suffixes (e.g., -logy), larger word fragments (hypophys-),words (spleen) or even combination of words (yellow fever). The possible combinationsof these word-forming elements are immense and ad-hoc term formation is common. Asa consequence, a high coverage of a domain-specific lexicon can only be expected if lexi-cal units are restricted to units of atomic sense, which then can be used as building blocksfor composed terms at any level of granularity. Extracting atomic sense units from textsin order to achieve a basis for cross-language semantic document indexing is an impor-tant goal for many applications in the fields of information extraction, text mining anddocument retrieval [Schulz and Hahn 2000]. The latter is the main application context ofthe MORPHOSAURUS system1 which builds upon a multilingual lexicon of semanticallyatomic lexical units covering the domain of medicine. In the following we will give asemi-formal account to lexical atomicity as the theoretical basis of the MORPHOSAURUSsubword lexicon. We will then turn to an empirically founded scheme for the delimitationof words and lexical items at a sub-word level. Our application domain is medicine; weuse examples in English, Spanish, and Portuguese language. Finally, we will present the
MORPHOSAURUS lexicon and its lexicographic guidelines as a concrete instance of the
We here introduce the notion of “semantic atomicity” which will guide our further argu-mentation in this article. A sequence of characters is semantically atomic if the sense conveyed2 (in a given lan-guage and a given domain context) is not univocally derivable from the sense of itsconstituents. In linguistic terms, the constituents of words are morphemes, and theyare tied together by word-forming operations such as inflexion, derivation and compo-sition. Inflexion conveys number, gender, tense, or aspect information, thus combiningthe lexical sense of the word stem with the grammatical function of the affix. Deriva-tion, instead, covers different phenomena. A derivational affix may simply affect thepart of speech without any semantic implication (patient with a severe injur-y = severe-lyinjur-ed patient). Or it may add an additional sense, such as hepatitis = hepat (liver) +itis3 (inflammation). However, cases in which the derived form has gained sense of itsown are frequent. For instance, neurosis is the result of linking neur (nerve) with osis(disease). However, the sense of neurosis is not really a disease of nerves (at least inmodern scientific medicine). As a consequence, the derivation neurosis would be con-sidered an atomic lexical unit. (Single-word) composition, finally, combines two or more
1http://www.morphosaurus.net2We understand by the sense of a linguistic expression the mental construction associated with this
expression, in contrast to the words’ referents (concrete objects in the world) [Eco et al. 1988].
3cf. discussion of related work on domain-specific suffixes in [Schulz and Hahn 2000]
stems in one and the same word. It is a very frequent phenomenon in Germanic lan-guages, but also in technical sublanguages where words like adenosintriphosphat, pre-betalipoproteinemia, osteoartrose, inmunodeficiencia, referred to as “neoclassical com-pounds” [McCray et al. 1988], are common.
Lexical units may have multiple senses (homonymy, in a broad sense); and one
sense can be expressed by different expressions (synonymy). Although domain specificterminologies are constructed in order to control the use of a specialized language andto avoid ambiguous expressions, non-standardized terminology is widely used in any do-main. For instance, molar has a completely different sense in obstetrics (molar pregnancy)than in lab medicine (molar mass), or in dentistry (fractured molar). Head has a differ-ent sense in headache than in head of femur or head of department. Operation means“surgical procedure” in a medical domain, opposed to different senses in mathematicsor business. In such cases, the local context (the surrounding words) generally helps usselect the right sense. Furthermore, the restriction to a well-defined domain (e.g. clinicalmedicine, in our case) allows us to ignore word senses which are definitely outside thatdomain (e.g. the sense head as the role of a word in grammar theory).
Besides ambiguity, lexical units may have overlapping senses. Quasi-synonymy
relations can hold between terms of different language (caput, head) or different levelsof erudition (belly, abdomen). Complete identity in sense (true synonymy) which holdsthroughout all possible uses of a word is rare. If we want to establish classes of syn-onymous expressions we have to make, firstly, a clear commitment to the environmentin which the expressions are considered synonymous, viz. what we call the domain con-text, and secondly, convene upon a tolerance in sense deviation which is still compatiblewith the formal properties of an equivalence relation4: If we agree on considering diseasea synonym of illness and illness a synonym of sickness, then disease and sickness aresynonyms, as well. The tolerance depends also on the relevance of subtle sense distinc-tions in the chosen domain context. In the domain of clinical medicine, e.g., neoplasm-,cancer, carcinom- would hardly be considered synonyms but a different decision may,however, be taken in another domain. A counterexample would be to create an equiva-lence class {excis-, exstirp-, remoc-, -ectom-} in a domain of general medicine, neglect-ing subtle distinctions of surgical technique. Translation is a special case of synonymy inwhich words of different languages are linked. Here we can define equivalence classes, aswell, e.g. {disease, illness, enfermedad, doenc¸a}. Not only the grouping of lexical unitsinto synonymy classes, but also their proper delimitation depends on the domain context. Leukemia, e.g., literally means “white blood”, and neurosis literally means “nerve dis-ease”. This may be plausible in a historic medical context, but it provides an incompletedescription when related to modern medicine. Thus, a composite sense may be ascribedin the historic context, and an atomic one in the present one.
In order to represent atomic senses of lexical units we define a semantic layer,
which contains language-independent identifiers, so called MIDs (MorphoSaurus IDs). MIDs can be roughly compared to concepts in thesauri (such as CUIs in the UMLSmetathesaurus [UMLS 2004] or to synsets in WordNet [Fellbaum 1998])5. However, there
4reflexivity, transitivity, symmetry5In terms of notation, MID will be represented by the composition of the # sign with one of its non-
ambiguous English lexemes, e.g. #liver = {hepar, hepat, liver, figad, higad } or #caput = {caput, cabec,
are two major differences between MIDs and UMLS CUIs or WordNet synsets: Firstly,MIDs can represent disjunctions of different senses. This is the case when ambiguouslexical units are addressed. To take the above example, the disjunction of the differentsenses of molar is represented by one MID, and each of the non-ambiguous senses byanother MID, each. Secondly, all lexical units which are assigned to one MID must befully interchangeable. For example, {head, caput, cabec, cabez, cefal, cephal } wouldnot be a proper representation of one MID, since head has additional senses, at least in adomain context which includes the meaning of head as “person in charge of sth.”.
A different view on MIDs is to regard them as non-ambiguous words of an in-
terlingua, since each synonym class is uniquely identified by one MID. This perspectiveemphasizes our preference of representing lexical meaning abstracting away from the va-riety of human language, an exercise that must not be mistaken for the construction of adomain ontology (cf. [Hirst 2004]).
We now introduce the notion of a subword as the minimal meaning-bearing con-
stituent of a domain-specific term. Its defining property is that its sense is not composite. This rules out, for instance, to consider hepatitis a valid subword because its sense canbe derived from its constituents, in contradistinction to, e.g., hypophysis (composing thesense of its components hypo and physis does not lead to the proper sense of hypoph-ysis), i.e. hypophysis is semantically underdeterminate. For each subword there exists atmost one MID where the assignment of the MIDs depends on the domain context d andthe language under consideration i. If no meaning is assigned to a subword, it is a stopentry (it has only a grammatical function), such as auxiliary verbs or inflection endings. The relation between lexical unit, sense, domain context6 and language can therefore beexpressed by the quadruple (LU, MID, D, L). Let us now consider some typical examples:
• (l1, m, d, i), (l2, m, d, i), (l3, m, d, i)
l1−3 are synonyms in domain d and language i since they refer to the same MIDm. Example: nephr-, ren-, kidney• (l1, m, d, i1), (l2, m, d, i2)
l1 in language i1 is the translation of l2 in language i2 in domain d which is ex-pressed by the reference to the same MID m. Example: nephr-, ri˜non. • (l, m1, d, i), (l, m2, d, i)
l has the two senses m1 and m1 in domain d and language i. Example: head refersboth to body parts and to persons who are in charge of something. • (l, , d, i1), (l, m2, d, i2)
l is a stop entry in language i1 and it has the sense m2 in language i2. Example:era is an auxiliary verb form in Spanish and Portuguese and a noun in English. • (l1, m1, d1, i1), (l2, m1, d1, i1), (l1, m2, d2, i1), (l2, m3, d2, i1)
l1 and l2 are synonyms in language i1 and domain d1 but not in domain d2. Example: sildenafil and viagra can be considered synonyms in clinical medicinebut not in the context of pharmaceutical industry.
MIDs can be linked by two lexical relations, viz. the horizontal (syntagmatic) relationexpands-to , and the vertical (paradigmatic) relation has-sense:
6We will not need an elaborated theory of domain contexts for the following examples. For a detailed
discussion cf. [Buvaˇc et al. 1994]. • The relation expands-to(m0, [m1, m2, ., mn]) relates a MID m0 to an ordered list
of MIDs (at least 2 elements). This relation is used in order to make a hidden se-mantic compositionality explicit. Example: The MID assigned to the lexical itemshort is is expanded to the sequence of the MID representing {length, longitud,comprimento} and the MID representing the meaning of “high value”. The rela-tion expands-to is also used to deal with composed meanings in compounds whichexhibit omission of characters, e.g. urinalysis (see below). • The relation has-sense(m0, {m1, m2, ., mn}) relates an ambiguous MID to a set
of MIDs (at least 2 elements). This relation is used to relate an ambiguous MID toeach of its (non-ambiguous) senses. Example: The MID assigned to the ambigu-ous word head is related via has-sense to the non-ambiguous MIDs for “upperpart of the body” and “person in charge of sth.”.
Both relations are transitive. Insertions into lists or sets create expanded lists or sets, notnested ones, e.g.:
• expands-to(m0, [m1, m2]) & expands-to(m1, [m3, m4]) is equivalent to
expands-to(m0, [m3, m4, m2])
• has-sense(m0, {m1, m2}) & has-sense(m1, {m3, m4}) is equivalent to
has-sense(m0, {m3, m4, m2})
Cycles are not allowed. A set of inter-MID relations is called normalized if all possiblesubstitutions are realized. A set of quadruples, together with a set of inter-MID relationsdefines a multi-context multilingual dictionary D. Other than in many thesauri such as theUMLS [UMLS 2004] or WordNet [Fellbaum 1998], we do not define semantic relationsbetween equivalence classes such as hypernymy, hyponymy, mereonymy etc. Encodingthese richer relations is left to domain thesauri or ontologies such as MeSH [MESH 2004]or SNOMED CT [sno 2004]. MIDs can be linked to external vocabularies or ontologiesby the following triple:(MID, ONT, EID+). ONT is the identifier of the external source, EID is the identifier ofthe term / class / concept of the external source (conjunctions of identifiers are possible). If the MID is ambiguous with regard to the external vocabulary, there will be one recordfor each EID+ per MID.
In the following, we describe a concrete implementation of the lexicon model as intro-duced above, viz. the structure of the MORPHOSAURUS lexicon, a multilingual subwordrepository covering the domain of clinical medicine. The MORPHOSAURUS lexicon pro-vides the data base for the MORPHOSAURUS indexer, a tool which extracts meaningfulitems from texts and maps them to MIDs, resulting in a language-independent abstrac-tion of text contents. The MORPHOSAURUS lexicon, so far, does not manage multiplecontexts. Rather it is committed to one, well-defined domain context, viz. clinical medi-cine. We introduce further specifications and conventions which characterize the MOR-
PHOSAURUS lexicon and from which guidelines for lexicon construction and manage-
ment can be derived. This lexicon is mainly a lexicon of subwords, as introduced above,but it contains – for reasons to be explained in the following – a limited number of multiword entries. We therefore refer, in the following to the broader term “lexical unit”, ratherthan “subword”.
3.1. Attributes of lexicon entriesEvery lexical unit is classified according to one of the following categories:
• Language: English (en), Spanish (sp), German (ge), Portuguese (pt), French (fr),
Swedish (sw). . . The language attribute refers to the real-world occurrence of lex-emes, including common foreign words. This means that English lexemes whichcommonly occur as foreign lexemes in a certain domain (e.g. shunt, round, feed-back) are considered lexemes of the respective host language. • Lexical units are word stems, prefixes, suffixes, infixes, proper prefixes, proper
suffixes, or invariants:Stems (ST), like gastr, hepat, enferm, diaphys, head are the primary content car-riers in a word. They can be prefixed, linked by infixes, and suffixed, some ofthem may also occur without affixes; Prefixes (PF), like de-, re-, in-, an- precedea stem once or more 7; Proper Prefixes (PP) like peri-, hemi-, down- are prefixesthat themselves cannot be prefixed; Infixes (IF), like -o-, in gastr-o-intestinal, or -r-, in hernio-r-rafia are used as a (phonologically motivated) glue between stems;Suffixes (SF) such as -a, -io, -ion, -tomy, -itis8 follow a stem or another suffix;Proper Suffixes (PS) (e.g. verb endings such as -ing, -ieron, -˜ao, -i´esemos) aresuffixes that cannot be suffixed. All these lexeme types are used for segmentationof inflected, derived and composed words, taking into account their compositionalconstraints. In contradistinction, Invariants (IV), like ion, gene coincide withwords and are not allowed as word parts. In most cases, these are short wordswhich would cause artificial ambiguities if they could be used as building blocksfor complex words.
We use the following notation for lexical items: The languages are added as su-
perscripts, the lexeme type as subscript, e.g. ectom[en,sp,pt] means that the string “ectom”
acts as a suffix in English, Portuguese, and Spanish. An MID represents the sense of agroup of lexemes which are considered synonymous in the given domain context, e.g. #remove = {ectom[en,sp,pt], exstirp[en,pt], estirp[sp], remov[en,sp,pt],. . . } Meaningless lex-
emes (stop entries), e.g. grammatical suffixes like -ation, -s, -ed, -aci´on, auxiliary andmodal verb forms like is, have, would, tuvieron, es, era, soy , s˜ao are not assigned to anMID, since they are ignored for indexing.
3.2. Equivalence Class RelationsAt introduced above, we link MIDs by two semantic relations, viz. has-sense, andexpands-to. Groups of lexemes which have (the same) multiple senses are assigned aMID of their own. The has-sense relation then connects such ambiguous MIDs to each ofits senses. Example: #lobo={lobo[sp,pt], lobos[sp,pt]} is linked by has-sense to both #wolf
={wolf[en],wolves[en],. . . } and #lobe ={lob[en],. . . }. #cold ={cold[en]} is linked to both
#lowtemp ={frio[sp,pt], fria[sp,pt],. . . } and #commoncold ={’common cold’[en],. . . }.
The expands-to relation links one or more non-atomic lexemes (which are also
grouped by a MID) to their atomic senses. There are mainly three reasons for this:
7E.g. in hemi-an-opsia the prefix an is prefixed by hemi8The classification of subwords like -logia or -itis as suffixes may be controversial. For the applications
supported by the MORPHOSAURUS lexicon, this is, however, of minor relevance. As a rule of thumb, ourcriterion for stems is that they do not require any other stem in order to build well-formed words.
1. Utterly short morphemes are not permitted as word constituents in order to pre-
vent improper segmentation of compounds. Words which contain these mor-phemes must therefore have their semantic decomposition pre-coded. For ex-ample, #myalg = {myalg[en], mialg[sp,pt]} is linked by expands-to to the sequence
of #muscle = {muscul[en,sp,pt], muscle[en],. . . } and #pain = {algy[en], algia[sp,pt],
pain[en],. . . }, thus avoiding the occurrence of my or mi in the lexicon;
2. An indecomposable lexeme in one language has a composed sense in the reference
language9. For example, #esparadrapo = {esparadrap[sp,pt]} is linked by expands-to to the sequence of #adhesive = {adhesiv[en,sp,pt],. . . } and #tape ={tape[en],. . . };
3. Compounds exhibit ellipsis (omission of characters): For example, #urinalise =
{urinalise[pt]} is linked by expands-to to the sequence of #urine = {urin[en,sp,pt],. . . }
and #analysis ={analys[en], analyis[sp,pt],. . . };
4. Words are nondecomposable but have an inherent composite semantic structure,
e.g. #broad = {broad[en],larg[sp,pt] } is linked by expands-to to the sequence of
#breadth = {largur[sp,pt], breadth[en], . . . } and #highgrade.
A comprehensive list of standard and domain-specific affixes is the starting point of sub-word dictionary building. Sources for affixes and infixes are the morphological grammarspecification for the respective languages.10 As a consequence, the main criterion forthe delimitation of a word stem is its compatibility with existing prefixes and suffixesin+compat+ibility, aprend-izaje, ventricul-i. Wherever derivation causes a clear changeof word sense which goes beyond the combined sense of the compounds, the derivategains status of new lexeme with a different MID, e.g. decubit- in addition to cubit-,neurot- in addition to neur-. Many words of Latin and Greek origin come with stem vari-ants (e.g., corpus, corpor+is; abdomen, abdomin+al, diagnos-is, diagnost-ico). Here, areduction to the common substring (corp- or abdom-) would cause the proliferation ofpseudo-suffixes (here -oris, -inal) on the one hand and the generation of short word stemson the other hand. In these cases stem variants are accounted for.
A high performance extraction of subwords from large amounts of text is best
achieved by the use of finite-state techniques for lexicon-based decomposition, dederiva-tion and deflection such as described in [Schulz and Hahn 2000]. Lexicon builders’ deci-sions of subword delimitation are therefore driven not only by formal linguistic criteria,but also by the proper function of segmentation. This is especially relevant with long andcomposed words where different valid segmentations are possible. For example, nephro-tomy may be segmented into nephr[en] (#kidney) + o[en,sp,pt] + tomy[en] (#incision), but
also into nephr[en] + oto[en] (#ear) + my[en] (#muscle). If the word segmentation routine,
here, prefers a long match from the left, the second (erroneous) segmentation would bepreferred. Only costly knowledge and language processing routines (which are not avail-able, in general) would be expected to detect this kind of errors. A pragmatic solution isto include additional synonymous lexeme variants. In our example, this means that the
9Reference Language is English. Therefore expansions into other languages are not allowed, since the
intelligibility of the semantic structure of this dictionary would be restricted to the speakers of that language.
10Common agglutination of suffixes may be pre-coded (e.g., -igkeiten, -izations, -ivelmente, -ectomies,
-ingness, -ationally).
sense #kidney is not only represented by nephr[en] but also by nephro[en] (as well as by
nefr[sp,pt] and nefro[sp,pt]).
3.4. Corpus-based validation of string specificity
Especially short or ambiguous word stems, such as gen, my, mi, ship are prone to sideeffects as described above. The shorter they are, the more frequently they arbitrarilyoccur as accidental substrings, producing erroneous segmentation results. In order toempirically assess this risk, we match them against word lists built from domain-specifictext corpora. Here we distinguish between two cases:
• The number of accidental matches is high: First, all correct matches have to be
checked. Here, in many cases, the short stem will occur at the beginning of aword. If this does not lead to false matches, we can add (unorthodoxly) this stemas a proper prefix in order to make use of the position constraint on this class oflexemes. If there are still many occurrences in the inside of words left, then, thepertaining compounds or prefix-stem combinations have to be added to the lexiconand linked to their components by expansion. An example therefore is the stemship. We must avoid that the sense of ship (vessel, to send) is extracted from anyword with the suffix -ship, e.g. relationship. Therefore ship is added both as aninvariant and a prefix (!) instead of a stem, together with usual inflections. Foreach excluded short stem, the most frequent compounds and derivatives have tobe included, together with their inflections (e.g. #eat = {eat[en], eats[en], eating[en],ate[en], eaten[en], eater[en]}). In order not to preclude synonymy match, e.g., #egg
= {ov[en,sp,pt] , ovo[en,sp,pt] , huev[sp], egg[en]}, the syntagmantic expansion link can
be used, e.g. #oocyte = {oocyte[en], oocit[sp,pt],} is linked by expands-to to #egg
• There are relatively few accidental matches. Here the strategy is the opposite one.
The stem is added to the lexicon, and the erroneously matching words are seg-mented. Wherever the erroneous stem happens to extracted, adjustments have tobe made at the components of these words. An example for this is the nephrotomyexample. Instead of eliminating oto as a stem, the stem variant nephro is added(see above) and thus false segmentation results are avoided.
3.5. Criteria for Inclusion of Subwords in the Dictionary
The selection of lexical units should reflect the language use in the domain of interest. Again, we use word statistics extracted from extensive, language specific corpora in orderto measure the relevance of terms. Ideally, each lexicon entry should correspond to anatomic (indivisible) entity of semantic reference. However, there are borderline cases,especially where a composed lexeme may have an atomic synonym. As a consequence,the atomic lexeme is either related to the components of its synonym by the relationexpands-to (a), or the composed lexeme is entered as a whole and equaled with its atomicsynonym. Example:
1. #ascorb = {ascorb[en,sp,pt], vitamin c[en], vitamina c[sp,pt]}.
2. #ascorb is expanded to the sequence of #vitamin = {vitamin[en,sp,pt] } and #C =
The latter case is preferred if the components of the composed lexeme are semanticallyrelevant, the first one if the components are semantically weak.
In contrast to the general rule, semantically underdetermined complex lexemes
or noun groups need not to be included in the dictionary as long as there exists a strictmapping through all languages of interest. As an example, the sense of the term yellowfever is not derivable from its components, but its components literally translate to alllanguages (fiebre amarilla, febre amarela, gelbfieber).
Proper names are entered into the lexicon under the following circumstances: (i)
they are needed for synonym linkage, e.g. between different product names, e.g. #di-clofenac = {diclofenac[en,sp,pt], voltaren[en,sp,pt], cataflam[en,sp,pt]}; (ii) they are used as
eponyms, i.e. they belong to the domain terminology (e.g. crohn, parkinson); (iii) trans-lations exist, especially with regard to geographic terms (suiza = switzerland).
3.6. Aspects of lexicon constructionFinally we outline the process of lexicon construction as it underlies the MORPHOSAURUSlexicon. It is based upon the view that the delimitation of classes of semantic equivalenceis mainly an intellectual task which cannot be fully automatized. Therefore, as a startingpoint, each lexicon entry has its own MID. If the lexicon designer concludes that two lex-icon entries have identical sense, then the two MIDs are fused. The incremental fusion oflexemes, however, leads repeatedly to a class of decisions which we can consider the maindilemma of the lexicon engineering process. Let K, L, and M be atomic lexical items. Two users group these items in different ways, according to slightly different subdomaincontexts, here represented by D1 and D2, respectively. In D1 the lexical items K and Lare considered synonyms. In D2, however, M and not L is considered a synonym of K. The fusion of these two subcontexts gives raise to the two solutions, viz. closure and sum. Whereas the closure operation merges the synonym classes, the sum operation preservesthe context-related distinction and introduces two senses for the ambiguous equivalenceclass. The decision of whether following the one or the other strategy is complex. Onthe one hand, we end up with a tight network of ambiguous senses when pursuing thelatter strategy. On the other hand, the transitive closure tends to yield numerous synonymclasses in which pairs of lexemes are far from being synonymous. As an example, auser may assert synonymy between head and caput in an anatomy subdomain. Anotherone equalizes head with chief, when modeling terms in a subdomain of administration. Applying the closure operation, chief would become synonym to caput, and all literaland figurative senses of head would be represented by one MID. Applying the sum op-eration, head would be assigned an ambiguous MID which then would be related to itsnon-ambiguous senses.
The construction of multilingual dictionaries which account for the variety of meanings indifferent domain contexts and languages constitutes a major challenge, even if restrictedto a technical sublanguage such as the medical one. We have presented an approach whichconcentrates on the economic encoding of subwords as lexical units. The main criterionfor the inclusion of a subword entry in the lexicon is semantic atomicity, since semanti-cally composed entries can be reconstructed out of atomic ones. Beside the proper delim-itation of lexical items, which should optimize both generality (to warrant a high recall)
and specificity (to warrant a high precision), the grouping of lexical items in domain-specific equivalence classes has posed problems which have required the formulation ofrigid editing guidelines for lexicon developers and are currently guiding the developmentof benchmarking and validation tools. Presently, the MORPHOSAURUS lexicon containsabout 80,000 lexical items which are related to about 20,000 equivalence classes. Due toits compositional character it has a high coverage for English, Portuguese, Spanish, andGerman. French and Swedish lexicons are currently under construction.
Acknowledgements: This research was sponsored by CNPq, the Brazilian Research
(2004). SNOMED Clinical Terms. Northfield, IL: College of American Pathologists.
Buvaˇc, S., Buvaˇc, V., and Mason, I. A. (1994). The semantics of propositional contexts. In Pro-ceedings of the 8th International Symposium on Methodologies for Intelligent Systems. Berlin:Springer.
Eco, U., Robering, K., Scheffczyk, A., and Habermeier, R. (1988). Metamorphoses of the semiotic
triangle. Zeitschrift f¨ur Semiotik, 10(3).
Fellbaum, C., editor (1998). WORDNET: An Electronic Lexical Database. Cambridge, MA: MIT
Hahn, U., Mark´o, K., Poprat, M., Schulz, S., Wermter, J., and Nohama, P. (2004). Crossing
languages in text retrieval via an interlingua. In RIAO 2004 – Conference Proceedings: Cou-pling Approaches, Coupling Media and Coupling Languages for Information Retrieval, pages100–115. Avignon, France, 26-28 April 2004. Paris: Centre de Hautes Etudes Internationalesd’Informatique Documentaire (CID).
Hirst, G. (2004). Ontologies and the lexicon. In Staab, S. and Studer, R., editors, Handbook on On-tologies, International Handbooks on Information Systems, pages 209–229. Berlin: Springer.
McCray, A. T., Browne, A. C., and Moore, D. L. (1988). The semantic structure of neo-classical
compounds. In Greenes, R. A., editor, SCAMC’88 – Proceedings of the 12th Annual Symposiumon Computer Applications in Medical Care, pages 165–168. Washington, D.C., November1988. New York, N.Y.: EEE Computer Society Press.
MESH (2004). Medical Subject Headings. Bethesda, MD: National Library of Medicine.
UMLS (2004). Unified Medical Language System. Bethesda, MD: National Library of Medicine.
Schulz, S. and Hahn, U. (2000). Morpheme-based, cross-lingual indexing for medical document
retrieval. International Journal of Medical Informatics, 59(3):87–99.
Schulz, S., Mark´o, K., Sbrissia, E., Nohama, P., and Hahn, U. (2004). Cognate mapping: A
heuristic strategy for the semi-supervised acquisition of a Spanish lexicon from a Portugueseseed lexicon. In COLING Geneva 2004 – Proceedings of the 20th International Conferenceon Computational Linguistics, volume 2, pages 813–819. Geneva, Switzerland, August 23-27,2004. Association for Computational Linguistics.
Österreichisches Umweltzeichen Für weitere Informationen kontaktieren Sie bitte eine der Umweltzeichen-Adressen Bundesministerium für Land- und Forstwirtschaft, Umwelt und Wasserwirtschaft, Abteilung VI/5 Tel: +43 (0)1 515 22-1250; Fax: Dw. 7649 Tel: +43 (0)1 588 77-208; Fax: Dw. 99207 e-m@il: josef.raneburger@lebensministerium.at Inhaltsverzeichnis Allgemeine Regelungen für Ro
other alignment, but every alignment and philosophy is Religion: Scouts have varied and individual takes on (excerpt from Complete Adventurer, page 10) religion, and no single religion stands out as typical of the Any force on the move, whether it’s an army or an class. Scouts occasionally pay homage to deities of nature, adventuring group, needs information about what’s ahead but t