Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław...

31
Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl Beyond the Transfer and Merge Wordnet Construction: plWordNet and a Comparison with WordNet

Transcript of Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław...

Page 1: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz

G4.19 Research GroupWrocław University of Technology

nlp.pwr.wroc.pl

plwordnet.pwr.wroc.pl

Beyond the Transfer and Merge Wordnet Construction:

plWordNet and a Comparison with

WordNet

Page 2: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Wordnet

{samochód 1, pojazd samochodowy 1, auto 1, wóz 1 `car, automobile’ }

{pogotowie 3, karetka 1, sanitarka 1, karetka pogotowia 1 `ambulance’ }

meronymy

{ samochodzik 2 `small car’ }deminutiveness

{bagażnik 1 `boot’ }

hypernymy/hyponymy

Page 3: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

plWordNet 2.0

Page 4: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Independent vs. Translation-based Wordnet Construction

• Transfer and merge.Examples: – EuroWordNet – most component wordnets built

by the transfer method (Vossen 2002)

– MultiWordNet – semi-automatic acquisition method from the Princeton WordNet (Bentivogli et. al. 2000)

– IndoWordNet – expansion from Hindi Wordnet (Sinha et al. 2006, Bhattacharyya 2010)

– FinWordNet – directly translated from the Princeton WordNet

Page 5: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Independent vs. Translation-based Wordnet Construction

• From scratch.Examples: –GermaNet – the core built

independently– plWordNet – a unique, corpus-based

method; largely independent of the Princeton WordNet

Page 6: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Synonymy and synsets

• “A wordnet is a collection of synsets linked by semantic relations.”

• A synset is a set of synonyms which represent the same lexicalised concept

• Synonyms are members of the same synset

Wordnet development deserves better: an operational theory with precise guidelines for wordnet editors.

Page 7: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Basic building block: synset vs lexical unit?

• Synset relations link lexicalised concepts• But are named after linguistic lexico-semantic

relations• Substitution tests are defined for lexical units • Synsets group lexical units• Every wordnet includes relations between

lexical units (lexical relations), e.g., antonymy• Lexical units can be observed in text,

concepts cannot

Page 8: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Constitutive relations

• Synset = a group of lexical units which share all constitutive relations

• Constitutive relation = a lexico-semantic relation which– is frequent enough– and frequently shared by groups

Also– is established in linguistics– and accepted in the wordnet tradition

• Examples: hypernymy, meronymy, cause

Page 9: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Synset as an abbreviation

Synset as a notational conventionfor a group of lexical units sharing certain relationsrepresents synonyms{afekt 1 `passion’, uczucie 2 `feeling’} hypernym

{miłość 1 `love’, umiłowanie 1 `affection’ , kochanie 1 `loving’}

This is based on constitutive relationsAdditional distinctions: stylistic register and aspectMinimal committment principle: make as few

assumptions as possible

Page 10: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Relations in plWordNet

• Starting point: relations in Princeton WordNet, EuroWordNet and GermaNete.g., hyponymy, meronymy, antonymy,cause, instance for proper names

• Additional constitutive relations– e.g., verb meronymy, preceding,

presupposition, – gradation for adjectives

Page 11: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Relations in plWordNet

• Specific: derivationally based lexico-semantic relations, e.g.,– inhabitant (góral ‘highlander’ – góry

‘highlands’)– inchoativity (zapalić sięperfect `light, start

burning' -- palić sięimperfect `burn, produce light')

– process (chamiećimperfect `to become a boor‘ – cham `boor‘)

Page 12: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Construction process

1. Data collection: 1.8 billion words corpus2. Data selection phase– corpus browsing– WSD-based word usage example extraction– WordnetWeaver: semi-automatic expansion

3. Data analysis – questions• is it a correct Polish lemma?• how many lexical units does it have?• how to describe them with relations?

• Other knowledge sources: available Polish dictionaries, thesauri,

encyclopaedias, lexicons, the Web, and intuition.

Page 13: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

The result – size matters

compared withPrinceton WordNet:

• General statistics• Lexical coverage• Polysemy• Synset size• Relation density• Hypernymy depth

www.plwordnet.pwr.wroc.pl

Page 14: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

General statistics

Number of synsets, lemmas and LUs in the largest wordnets

Page 15: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Lexical coverage

Proportion of lemmas from PWN/plWN found among vocabulary with a given corpus frequency

Page 16: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Polysemy

Proportion of polysemous lemmas with regard to POS

Page 17: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Relation density

Synset relation density in PWN 3.1 and in plWordNet 2.0

Page 18: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Hypernymy depth

Hypernymy path length for nouns in PWN 3.1and plWordNet 2.0

Page 19: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Hypernymy depth

Polish WordNet

Princeton WordNet

Page 20: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Hypernymy depth

Computer

ElectricDevice

Device

Artifact

Object

Physical

Entity

Polish WordNet

Princeton WordNet

SUMO

Page 21: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Mapping procedure:plWordNet onto Princeton WordNet

1.Recognise the sense of the source synset: • the position in the network structure • existing relations, commentaries; other synsets

containing the given lemma

2.Search the target synset• candidates for the target synset: intuitions,

automatic prompting and dictionaries • verifying candidates:

• comparing hypernymy and hyponymy structures• existing inter-lingual relations; • definitions, commentaries; dictionaries

3.Link the source synset with the target synset

Page 22: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Hierarchy of inter-lingual relations• Inter-lingual Synonymy (only one per

synset) • Inter-lingual inter-register synonymy• I-partial synonymy• I-hyponymy• I-hypernymy• I-meronymy

for parts, elements or materials of bigger wholes

• I-holonymy for a whole made of smaller parts, elements or

materials

Page 23: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Results of inter-lingual mapping• Mapping direction: plWordNet – Princeton WordNet• Bottom-up – from the lowest levels in the hierarchy up• ~48 300 synsets mapped (~64 400 lexical

units/senses)– Synonymy: 15268– Partial synonymy: 971– Inter-register synonymy: 676– Hyponymy: 23677– Hypernymy: 3526– Meronymy: 1898– Holonymy: 555

• Mapped branches– people, artefacts, places, food, time units: all

communication, states and processes, body parts, group names: partially

Page 24: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Different relations for coding the same conceptual dependencies

Page 25: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Applications

Free WordNet-type licence facilitate applications. Examples:• Semantic annotation in a corpus of referential gestures (Lis, 2012)• Lexicon of semantic valency frames (Hajnicz, 2011; Hajnicz, 2012)• Features for text mining from Web pages (Maciolek and Dobrowolski,

2013)• Mapping between a lexicon and an ontology (Wróblewska et al., 2013)• Word-to-word similarity in ontologies (Lula and Paliwoda-Pękosz, 2009)• Text similarity for Information Retrieval (Siemiński, 2012)• Text classification (Maciołek, 2010)• Terminology extraction and clustering (Mykowiecka and Marciniak,

2012)• Automated extraction of Opinion Attribute Lexicons (Wawer and

Gołuchowski, 2012)• Named Entity Recognition • Word Sense Disambiguation (Gołuchowski and Przepiórkowski, 2012)• Anaphora resolutionMore than 500 registered users, ~70 declared commercial applications

Page 26: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Conclusions

• plWordNet 2.0 – a national wordnet not adapted from Princeton WordNet

• plWordNet 2.0 is comparable to WordNet 3.1in size, as well as in lexical coverage, hypernymy

depth and relation density• Synset membership depends only on

constitutive relations between lexical units.• A unique mapping strategy and a unique

opportunity to compare the two lexical systems

• plWordNet 3.0 (2015): – a comprehensive wordnet of Polish– 200k of lemmas and 260k of LUs, mapped to PWN

3.?

Page 27: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Thank-you

www.plwordnet.pwr.wroc.pl

Thank you!

Page 28: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Differences between plWN and PWN• Inter-lingual lexico-grammatical

differences: – marked forms (diminutives,

augmentatives)– lexicalised gender– lexical gaps

• Differences in the definition of synonymy and synset:– 'Mixed' PWN synsets – marked and

unmarked forms, feminine and masculine, countable and uncountable, hypernym and hyponym- hypernymy and (plWN) vs. and/or (PWN)

Page 29: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Differences between plWN and PWN• Other differences:– synset definitions incompatible with

relations (PWN)– different relations used for coding the

same conceptual dependencies– more fine-grained meaning

differentiation– differences boiling down to the content

and size of resource

Page 30: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Differences in lexicalisation

Page 31: Marek Maziarz, Maciej Piasecki, Ewa Rudnicka, Stanisław Szpakowicz G4.19 Research Group Wrocław University of Technology nlp.pwr.wroc.pl plwordnet.pwr.wroc.pl.

Relation density

Synset relation density in PWN 3.1 and in plWordNet 2.0

in the select semantic domains

Semantic domain