Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław...
-
Upload
meryl-rodgers -
Category
Documents
-
view
215 -
download
0
Transcript of Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław...
Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz
G4.19 Research GroupWrocław University of Technology
nlp.pwr.wroc.pl
plwordnet.pwr.wroc.pl
Beyond the Transfer and Merge Wordnet Construction:
plWordNet and a Comparison with
WordNet
Wordnet
{samochód 1, pojazd samochodowy 1, auto 1, wóz 1 `car, automobile’ }
{pogotowie 3, karetka 1, sanitarka 1, karetka pogotowia 1 `ambulance’ }
meronymy
{ samochodzik 2 `small car’ }deminutiveness
{bagażnik 1 `boot’ }
hypernymy/hyponymy
plWordNet 2.0
Independent vs. Translation-based Wordnet Construction
• Transfer and merge.Examples: – EuroWordNet – most component wordnets built
by the transfer method (Vossen 2002)
– MultiWordNet – semi-automatic acquisition method from the Princeton WordNet (Bentivogli et. al. 2000)
– IndoWordNet – expansion from Hindi Wordnet (Sinha et al. 2006, Bhattacharyya 2010)
– FinWordNet – directly translated from the Princeton WordNet
Independent vs. Translation-based Wordnet Construction
• From scratch.Examples: –GermaNet – the core built
independently– plWordNet – a unique, corpus-based
method; largely independent of the Princeton WordNet
Synonymy and synsets
• “A wordnet is a collection of synsets linked by semantic relations.”
• A synset is a set of synonyms which represent the same lexicalised concept
• Synonyms are members of the same synset
Wordnet development deserves better: an operational theory with precise guidelines for wordnet editors.
Basic building block: synset vs lexical unit?
• Synset relations link lexicalised concepts• But are named after linguistic lexico-semantic
relations• Substitution tests are defined for lexical units • Synsets group lexical units• Every wordnet includes relations between
lexical units (lexical relations), e.g., antonymy• Lexical units can be observed in text,
concepts cannot
Constitutive relations
• Synset = a group of lexical units which share all constitutive relations
• Constitutive relation = a lexico-semantic relation which– is frequent enough– and frequently shared by groups
Also– is established in linguistics– and accepted in the wordnet tradition
• Examples: hypernymy, meronymy, cause
Synset as an abbreviation
Synset as a notational conventionfor a group of lexical units sharing certain relationsrepresents synonyms{afekt 1 `passion’, uczucie 2 `feeling’} hypernym
{miłość 1 `love’, umiłowanie 1 `affection’ , kochanie 1 `loving’}
This is based on constitutive relationsAdditional distinctions: stylistic register and aspectMinimal committment principle: make as few
assumptions as possible
Relations in plWordNet
• Starting point: relations in Princeton WordNet, EuroWordNet and GermaNete.g., hyponymy, meronymy, antonymy,cause, instance for proper names
• Additional constitutive relations– e.g., verb meronymy, preceding,
presupposition, – gradation for adjectives
Relations in plWordNet
• Specific: derivationally based lexico-semantic relations, e.g.,– inhabitant (góral ‘highlander’ – góry
‘highlands’)– inchoativity (zapalić sięperfect `light, start
burning' -- palić sięimperfect `burn, produce light')
– process (chamiećimperfect `to become a boor‘ – cham `boor‘)
Construction process
1. Data collection: 1.8 billion words corpus2. Data selection phase– corpus browsing– WSD-based word usage example extraction– WordnetWeaver: semi-automatic expansion
3. Data analysis – questions• is it a correct Polish lemma?• how many lexical units does it have?• how to describe them with relations?
• Other knowledge sources: available Polish dictionaries, thesauri,
encyclopaedias, lexicons, the Web, and intuition.
The result – size matters
compared withPrinceton WordNet:
• General statistics• Lexical coverage• Polysemy• Synset size• Relation density• Hypernymy depth
www.plwordnet.pwr.wroc.pl
General statistics
Number of synsets, lemmas and LUs in the largest wordnets
Lexical coverage
Proportion of lemmas from PWN/plWN found among vocabulary with a given corpus frequency
Polysemy
Proportion of polysemous lemmas with regard to POS
Relation density
Synset relation density in PWN 3.1 and in plWordNet 2.0
Hypernymy depth
Hypernymy path length for nouns in PWN 3.1and plWordNet 2.0
Hypernymy depth
Polish WordNet
Princeton WordNet
Hypernymy depth
Computer
ElectricDevice
Device
Artifact
Object
Physical
Entity
Polish WordNet
Princeton WordNet
SUMO
Mapping procedure:plWordNet onto Princeton WordNet
1.Recognise the sense of the source synset: • the position in the network structure • existing relations, commentaries; other synsets
containing the given lemma
2.Search the target synset• candidates for the target synset: intuitions,
automatic prompting and dictionaries • verifying candidates:
• comparing hypernymy and hyponymy structures• existing inter-lingual relations; • definitions, commentaries; dictionaries
3.Link the source synset with the target synset
Hierarchy of inter-lingual relations• Inter-lingual Synonymy (only one per
synset) • Inter-lingual inter-register synonymy• I-partial synonymy• I-hyponymy• I-hypernymy• I-meronymy
for parts, elements or materials of bigger wholes
• I-holonymy for a whole made of smaller parts, elements or
materials
Results of inter-lingual mapping• Mapping direction: plWordNet – Princeton WordNet• Bottom-up – from the lowest levels in the hierarchy up• ~48 300 synsets mapped (~64 400 lexical
units/senses)– Synonymy: 15268– Partial synonymy: 971– Inter-register synonymy: 676– Hyponymy: 23677– Hypernymy: 3526– Meronymy: 1898– Holonymy: 555
• Mapped branches– people, artefacts, places, food, time units: all
communication, states and processes, body parts, group names: partially
Different relations for coding the same conceptual dependencies
Applications
Free WordNet-type licence facilitate applications. Examples:• Semantic annotation in a corpus of referential gestures (Lis, 2012)• Lexicon of semantic valency frames (Hajnicz, 2011; Hajnicz, 2012)• Features for text mining from Web pages (Maciolek and Dobrowolski,
2013)• Mapping between a lexicon and an ontology (Wróblewska et al., 2013)• Word-to-word similarity in ontologies (Lula and Paliwoda-Pękosz, 2009)• Text similarity for Information Retrieval (Siemiński, 2012)• Text classification (Maciołek, 2010)• Terminology extraction and clustering (Mykowiecka and Marciniak,
2012)• Automated extraction of Opinion Attribute Lexicons (Wawer and
Gołuchowski, 2012)• Named Entity Recognition • Word Sense Disambiguation (Gołuchowski and Przepiórkowski, 2012)• Anaphora resolutionMore than 500 registered users, ~70 declared commercial applications
Conclusions
• plWordNet 2.0 – a national wordnet not adapted from Princeton WordNet
• plWordNet 2.0 is comparable to WordNet 3.1in size, as well as in lexical coverage, hypernymy
depth and relation density• Synset membership depends only on
constitutive relations between lexical units.• A unique mapping strategy and a unique
opportunity to compare the two lexical systems
• plWordNet 3.0 (2015): – a comprehensive wordnet of Polish– 200k of lemmas and 260k of LUs, mapped to PWN
3.?
Thank-you
www.plwordnet.pwr.wroc.pl
Thank you!
Differences between plWN and PWN• Inter-lingual lexico-grammatical
differences: – marked forms (diminutives,
augmentatives)– lexicalised gender– lexical gaps
• Differences in the definition of synonymy and synset:– 'Mixed' PWN synsets – marked and
unmarked forms, feminine and masculine, countable and uncountable, hypernym and hyponym- hypernymy and (plWN) vs. and/or (PWN)
Differences between plWN and PWN• Other differences:– synset definitions incompatible with
relations (PWN)– different relations used for coding the
same conceptual dependencies– more fine-grained meaning
differentiation– differences boiling down to the content
and size of resource
Differences in lexicalisation
Relation density
Synset relation density in PWN 3.1 and in plWordNet 2.0
in the select semantic domains
Semantic domain