[go: up one dir, main page]

Skip to content

nezda/yawni

Repository files navigation

Build Status

Introduction / Why Yawni and Why WordNet ?

Yawning!

Yawni is an API to Princeton University's WordNet®. WordNet is a graph; it is a potentially invaluable resource for injecting knowledge into applications. WordNet is probably the single most used NLP resource ; many companies have it as their cornerstone. It embodies one of the most fundamental of all NLP problems: "word-sense disambiguation". The Yawni code library can be used to add lexical and semantic knowledge, primarily derived from WordNet, to your applications.

Yawni is written in the Java programming language.

The Yawni website is https://www.yawni.org/

Yawni currently consists of 3 main modules:

  • api/ Yawni WordNet API: a pure Java standalone object-oriented interface to the WordNet database of lexical and semantic relationships.

  • data*/ Yawni WordNet Data: Jar file containing the Princeton WordNet 3.0 data files, and derivative files to support efficient, exhaustive access to this information.

  • browser/ Yawni WordNet Browser: A GUI browser of WordNet content using the Yawni API.

🚀 Quick Start

Basic steps 👣

  1. Install JDK 8 (or greater), Apache Maven 3.0.3 (or greater)
  2. Specify the following Apache Maven dependencies in your project
    <dependency>
      <groupId>org.yawni</groupId>
      <artifactId>yawni-wordnet-api</artifactId>
      <version>2.0.0-SNAPSHOT</version>
    </dependency>
    <dependency>
      <groupId>org.yawni</groupId>
      <artifactId>yawni-wordnet-data30</artifactId>
      <version>2.0.0-SNAPSHOT</version>
    </dependency>
  1. Start using the Yawni API!: all required resources are loaded on demand from the classpath (i.e., jars) made accessible via a singleton:
    WordNetInterface wn = WordNet.getInstance();

Numerous unit tests that serve as great executable examples are included in api/src/test/java/org/yawni/. For a more complex example application, check out the browser/ sub-module.

Yet Another WordNet Interface !?

WordNet consists of enough data to exceed the recommended capacity of Java Collections (e.g., java.util.SortedMap<String, X>), but not enough to justify a full relational database.

There are a lot of Java interfaces to WordNet already. Here are 8 of the Java APIs, along with their URL and software license.

Many of the pure Java ones (like Yawni), are actually derivatives of Oliver Steele 's original JWordNet. In fact, Yawni is the “new” name of that original Java WordNet, JWordNet.

Why Yawni ?

  • commercial-grade implementation
    • 🚀 very fast & small memory footprint 👣
    • pure Java ☕ so it’s compatible with any JVM language! Scala, Clojure, Kotlin, …
    • facilitates access to all aspects of WordNet data and algorithms including "Morphy" morphological processing (i.e., lemmatization, i.e., stemming) routines
    • simple, intuitive, and well documented 📚 API
    • all required resources load from jars by default making deployment a snap 💥
    • all query results are immutable 🔒; safe for used in caches and/or accessed by concurrent threads
    • easy Apache Maven-based build with minimal dependencies
    • extensive unit tests 🧪 provide peace of mind (and great examples!)
  • includes refined GUI browser featuring
    • user-friendly 😊 🎛 🔍 & snappy 🚀
    • incremental find 🔍 (Ctrl+Shift+F / ⌘ ⇧ F)
    • no limits on search: Never see “Search too large. Narrow search and try again...” again!
    • comprehensive keyboard navigation ⌨ 🧭 support (arrows ⇦ ⇨ ⇧ ⇩, tab ↹, etc.)
    • multi-window 🪟🪟 support (Ctrl+N / ⌘ N)
    • cross-platform 🔀 including zero-install Java Web Start version
  • commercial-friendly Apache license

Changes in 2.x versions

  • Extreme speed improvements: literally faster than the C version (benchmark source included)
    • Bloom filters used to avoid fruitless lookups (no loss in accuracy!)
    • re-implemented LRUCache using Google Guava's MapMaker
    • FileManager.CharStream and FileManager.NIOCharStream utilize in-memory and java.nio for maximum speed
  • Major reduction in memory requirements
    • use of primitives where possible (hidden by API)
    • eliminated unused / unneeded fields
  • Implemented Morphy stemming / lemmatization algorithms
  • Completely rewritten GUI browser in Java Swing featuring
    • incremental find
    • no limits on search: Never see “Search too large. Narrow search and try again...” again!
  • Support for WordNet 3.0 data files (and all older formats)
  • Support for numerous optional and extended WordNet resources
    • 'sense tagged frequencies' (WordSense.getSensesTaggedFrequency())
    • 'lexicographer category' (Synset.getLexCategory())
    • 14 new 'morphosemantic' relations (RelationType.RelationTypeType.MORPHOSEMANTIC)
    • 'evocation' empirical ranks (WordSense.getCoreRank())
  • Supports reading ALL data files from JAR file
  • Many bug fixes
    • fixed broken RelationTypes
    • fixed Verb example sentences and generic frames (and made them directly accessible)
    • fixed iteration bugs and memory leaks
    • fixed various thread safety bugs
  • Updated to leverage Java 1.6 and beyond
    • generics
    • use of Enum, EnumSet, and EnumMap where apropos
    • uses maximally configurable slf4j logging system
    • added LookaheadIterator (analogous to old LookaheadEnumeration)
      • changed to even better Google Guava AbstractIterator
  • Growing suite of unit tests
  • Automated all build infrastructure using Apache Maven
  • New / changed API methods
    • renamed WordWordSense, IndexWordWord, PointerRelation, PointerTypeRelationType, PointerTargetRelationTarget
    • WordSense.getSenseNumber()
    • WordSense.getTaggedSenseCount()
    • WordSense.getAdjPosition()
    • WordSense.getVerbFrames()
    • Word.isCollocation()
    • Word.getRelationTypes()
    • Synset.getLexCategory()
    • RelationTarget.getSynset()
    • Word.getSenses() → Word.getSynsets()
    • Word.getWordSenses()
    • WordSense.getTargets()WordSense.getRelationTargets()
    • DictionaryDatabase iteration methods are Iterables for ease of use (e.g., for loops)
    • all core classes implement Comparable<T>
    • all core classes implement Iterable<WordSense>
    • added iteration for all WordSenses and all Relations (and all of a certain RelationType)
    • added support for POS.ALL where apropos
    • all major classes are final
    • currently, no major classes are Serializable
    • removed RMI client / server capabilities - deemed overkill
    • removed applet - didn't justify its maintenance burden