Apache UIMA Example Wrappers for the OpenNLP Tools
Copyright 2006 The Apache Software Foundation.
Introduction
OpenNLP Tools is an open
source package of natural language processing components written in pure
Java. The tools are based on Adwait Ratnaparkhi's Ph.D. dissertation
(UPenn,
1998), which shows how to apply Maximum Entropy models to various
language ambiguity problems. The OpenNLP Tools rely on the OpenNLP MAXENT package, a mature
Java package for training and using maximum entropy models.
The OpenNLP Tools package (as of Version 1.3) includes a sentence
detector, tokenizer, part-of-speech tagger, noun phrase chunker, shallow
parser, named entity detector, and co-reference resolver. All
together these tools provide a rich and powerful set of text analysis
capabilities.
The Apache UIMA Example Wrappers for OpenNLP provides UIMA annotators
for most of the OpenNLP Tools components, allowing you to run the OpenNLP
Tools as UIMA annotators. The wrapper annotators were written to be
very simple examples of how pre-existing analysis components can be
deployed using the UIMA framework. The wrappers provide a thin layer
over the OpenNLP classes and use the "outermost" APIs to those
classes. As such, most of the work performed by the wrappers
involves translating the contents of the CAS (i.e., the document and any
annotations) into the input format required by the OpenNLP API, then
translating the result returned by the OpenNLP API into new annotations in
the CAS.
The wrappers are not meant to represent an optimal integration of the
OpenNLP Tools into the UIMA framework. In fact, it is quite likely
that a more efficient integration could be achieved, e.g., by moving some
of the OpenNLP data structures into the CAS and avoiding much of the
copying and translating performed by the current implementation.
This version of the example wrappers requires version 1.3.0 of the
OpenNLP Tools and only supports the English version of the tools (and,
correspondingly, the English version of the models).
The rest of this Readme will show you how to compile and use the
OpenNLP Wrappers.
Prerequisites
To get started, you need to download OpenNLP Tools V1.3.0 from
SourceForge.net, compile the OpenNLP Tools package, create or download
from SourceForge.net the model files for the components you wish to run,
and finally compile the UIMA Wrappers for OpenNLP.
- Download OpenNLP Tools Go to the OpenNLP homepage (opennlp.sourceforge.net) and
follow the link there to download the latest release of the OpenNLP
Tools package, opennlp-tools-1.3.0.tgz. Note that the
"Download" link at the bottom of this page might not point to the latest
release, so be sure you get version 1.3.0 or later. This package
contains the source code for the OpenNLP Tools, a few jar files required
by OpenNLP, the OpenNLP documentation, and an Ant build script (among
other things).
- Compile OpenNLP Tools Follow the instructions in the README
file distributed in the OpenNLP Tools package to compile the OpenNLP
Tools and build the OpenNLP Tools jar file,
opennlp-tools-1.3.0.jar. The easiest way to do this is to
run Ant.
- Download the Model files Go to the OpenNLP homepage (opennlp.sourceforge.net) and
follow the "Models" link at the bottom of the page to download English
model files for the OpenNLP Tools components that you plan to run.
You'll find more details about the model files in the README file for
the OpenNLP Tools package.
- Compile the UIMA Wrappers for OpenNLP The UIMA Wrappers
package for OpenNLP is in the opennlp_wrappers sub-directory of the
uima_examples project distributed with the UIMA SDK (the directory where
you found this Readme file). To compile the wrappers, first import
the UIMA SDK uima_examples project into Eclipse using the instructions
in Section 3.2 of the UIMA SDK User's Guide and Reference
(assuming you haven't already done this). Next, add the
wrappers source directory to the build path of the uima_examples
project:
- Open the Properties dialog for the uima_examples project. You can
either "right click" on the exmple project and select "Properties"
from the menu, or select (highlight) the examples project then click
"Project->Properties" from the main menu.
- Click on "Java Build Path" to open the build path panel.
- Click on the "Source" tab to see the source folders on the build
path.
- Click "Add Folder..." and add "opennlp_wrappers/src" to the source
folders build path.
After adding the wrappers source directory, you should
get compilation errors. You now need to add the OpenNLP jar files to the
build path for the uima_examples project. Open the "Java Build Path"
panel for the uima_examples project again (as above), click on the
"Libraries" tab, and add the following OpenNLP jar files to the build
path:
- maxent-2.4.0.jar
- trove.jar
- opennlp-tools-1.3.0.jar
maxent-2.4.0.jar and trove.jar can
be found in the "lib" folder of the OpenNLP Tools package, and
opennlp-tools-1.3.0.jar is the jar file you built in step 2 above.
The exact location of these jar files will depend on where you
downloaded and compiled the OpenNLP Tools.
At this point, your wrappers should compile and you are now ready to
run the OpenNLP Tools as UIMA Annotators.
Quick Test
For a quick test, open the descriptor file for the sentence detector
wrapper
opennlp_wrappers/descriptors/OpenNLPSentenceDetector.xml
using the Component Descriptor Editor plugin for Eclipse (see Chapter
8 of the UIMA SDK User's Guide and Reference). Click on the
"Parameter Settings" tab and set the value of the "ModelFile" parameter
to point to the English sentence detector model you downloaded in step 3
above, e.g.:
C:\opennlp-models-1.3.0\english\sentdetect\EnglishSD.bin.gz
Save the descriptor. Start the UIMA Document Analyzer from
Eclipse as described in Chapeter 12 of the UIMA SDK User's Guide and
Reference. Set the Input and Output directories
as shown in Section 12.2. For the Location of TAE XML
Descriptor, specify:
opennlp_wrappers/descriptors/OpenNLPSentenceDetector.xml
Note that the opennlp_wrappers folder is in the examples folder of
the UIMA SDK. Leave the remaining input fields alone and press
"Run". This will run the OpenNLP sentence detector on the UIMA SDK
sample data.
Double click on a document in the results list to bring up the Java
annotation viewer. You should see Sentence annotations (though
since the spans are contiguous, it may appear that an entire paragraph
is highlighted). Click on a Sentence annotation to see the
annotation details in the right-hand pane. When you expand the
details, you should see reasonable begin and end values.
Using the Example Wrappers
The OpenNLP Example Wrappers package includes source code for the
wrapper annotator classes, source code for the JCasGen-generated type
classes, and descriptor files for the analysis engines and type
system.
The source code is in "opennlp_wrappers/src", which you should now be
somewhat familiar with after following the instructions in the previous
section to compile the code. The Analysis Engine descriptors are in
"opennlp_wrappers/descriptors".
The following table summarizes the wrapper annotator classes and their
corresponding descriptor files (note that all of the wrapper annotators
are in the org.apache.uima.examples.opennlp.annotator package):
| Java Class |
Descriptor File |
Description |
| NEDetector.java |
OpenNLPNEDetector.xml |
Named entity detector (called name finder
in OpenNLP) |
| Parser.java |
OpenNLPParser.xml |
Shallow parser |
| POSTagger.java |
OpenNLPPOSTagger.xml |
Part-of-speech tagger |
| SentenceDetector.java |
OpenNLPSentenceDetector.xml |
Sentence detector |
| Tokenizer.java |
OpenNLPTokenizer.xml |
Tokenizer |
The descriptors folder also contains an aggregate analysis engine
descriptor, OpenNLPAggregate.xml, which can be used to run one or more
wrapper components.
The type system descriptor, OpenNLPExampleTypes.xml, can be found in
the org.apache.uima.examples.opennlp package in the "src" folder.
The type system descriptor is located here so that the analysis engine
descriptors can import it by name.
All of the annotators use the JCas interface to the CAS, so JCasGen has
been run on the type system. All of the JCasGen-generated type
classes are in the org.apache.uima.examples.opennlp package.
OpenNLP Wrapper Type System
The OpenNLP Wrapper type system defines UIMA annotation types for the
various annotations produced by each of the OpenNLP Tools
components. You can view the type system in detail by using the
Component Descriptor Editor plug-in for Eclipse and loading the type
system descriptor.
All of the types reside in the org.apache.uima.examples.opennlp
namespace. The types are summarized in this table:
| Sentence |
Spans a sentence, produced by
OpenNLPSentenceDetector. |
| Token |
Spans a token, produced by OpenNLPTokenizer. If
OpenNLPPOSTagger has been run, the the posTag field of the Token
will contain the part-of-speech tag. |
| Person |
Spans a Person entity, produced by
OpenNLPNEDetector. |
| Organization |
Spans an Organization entity, produced by
OpenNLPNEDetector. |
| Time |
Spans a Time entity, produced by
OpenNLPNEDetector. |
| Date |
Spans a Date entity, produced by
OpenNLPNEDetector. |
| Location |
Spans a Location entity, produced by
OpenNLPNEDetector. |
| Percentage |
Spans a Percentage entity, produced by
OpenNLPNEDetector. |
| Money |
Spans a Money entity, produced by
OpenNLPNEDetector. |
| Clause |
Supertype for all of the Clause annotations produced
by OpenNLPParser. |
| Phrase |
Supertype for all of the Phrase annotations produced
by OpenNLPParser. |
OpenNLPSentenceDetector
The OpenNLPSentenceDetector detects sentence boundaries and creates
Sentence annotations that span these boundaries. The sentence
detection is performed by
opennlp.tools.lang.english.SentenceDetector.
- Inputs
- none - The analysis engine operates directly on the document in
the CAS
- Outputs
- Sentence - one Sentence annotation for each detected sentence in
the document.
- Parameters
| Name |
Type |
Description |
| ModelFile |
String |
Path to the OpenNLP model file for the English
sentence detector |
OpenNLPTokenizer
The OpenNLPTokenizer tokenizes the text and creates token annotations
that span the tokens. The tokenization is performed with
opennlp.tools.lang.english.Tokenizer, which tokenizes according to the
Penn Tree Bank tokenization standard. In general, tokens are
separated by white space, but punctuation marks (e.g., ".", ",", "!", "?",
etc.) and apostrophe endings (e.g., "'s", "'nt", etc.) are separate
tokens.
- Inputs
- Sentence - The analysis engine requires Sentence annotations in
the CAS
- Outputs
- Token - one Token annotation for each detected token in the
document.
- Parameters
| Name |
Type |
Description |
| ModelFile |
String |
Path to the OpenNLP model file for the English
sentence tokenizer |
OpenNLPPOSTagger
The OpenNLPPOSTagger assigns part-of-speech tags to tokens using
opennlp.tools.lang.english.PosTagger. This annotator requires that
sentence and token annotations have been created in the CAS. The
annotator updates the POS field of each token annotation with the
part-of-speech tag.
- Inputs
- Sentence - The analysis engine requires Sentence annotations in
the CAS
- Token - The analysis engine requires Token annotations in the CAS
- Outputs
- Token.posTag - the posTag field in each Token annotation is
updated with the part-of-speech tag for the corresponding word.
- Parameters
| Name |
Type |
Description |
| ModelFile |
String |
Path to the OpenNLP model file for the English POS
tagger. Note that as of OpenNLP Tools 1.3.0, the POS tagger
model file can be found in the parser model files
folder. |
OpenNLPNEDetector
The OpenNLPNEDetector detects named entities in the text and creates
corresponding entity annotations that span the found entities. The
annotator uses opennlp.tools.lang.english.NameFinder, instantiating one
NameFinder for each entity class to be detected. Each entity class
has a separate MaxEnt model file. All model files must be stored in
a single model file directory and use the following naming convention:
"class.bin.gz", where "class" is the entity class name and
".bin.gz" must appear as shown, e.g., "person.bin.gz".
This
analysis engine takes a parameter called "EntityTypeMapping" which maps
each entity class name to an entity annotation type. The entity
class name must match a model file in the model file directory, and the
entity annotation type must be defined in the type system and have a
corresponding JCas Java class. This allows the actual annotation
types produced by the analysis engine to be specified as a run-time
parameter.
- Inputs
- Sentence - The analysis engine requires Sentence annotations in
the CAS
- Token - The analysis engine requires Token annotations in the CAS
- Outputs
- EntityAnnotation - The analysis engine creates an EntityAnnotation
for each entity detected in the document. The actual annotation
is typically a sub-type of EntityAnnotation specialized for the
particular entity class found, e.g., Person, Organizatoin, etc.
See the EntityTypeMapping parameter for more details.
- Parameters
| Name |
Type |
Description |
| ModelDirectory |
String |
Path to the directory that contains the OpenNLP
model files for the English name finder. All model files
must be stored in a single model file directory and use the
following naming convention: "class.bin.gz", where
"class" is the entity class name and ".bin.gz" must appear
as shown, e.g., "person.bin.gz". |
| EntityTypeMappings |
String Array |
Mapping from entity names (obtained from the model
filename) to the JCas class for the corresponding
annotation. Each mapping string is of the form "name,class",
i.e., the entity type name followed by a comma followed by the
annotation class. |
OpenNLPParser
The OpenNLPParser parses the document and creates phrasal and clausal
annotations over the text using
opennlp.tools.lang.english.TreebankParser.
This analysis engine
takes a parameter called "ParseTagMapping" which maps each parse tag to a
syntax annotation type. The parse tags come from the standard Penn
Tree Bank phrase and clause tags (produced by the OpenNLP parser), and
each syntax annotation type must be defined in the type system and have a
corresponding JCas Java class.
- Inputs
- Sentence - The analysis engine requires Sentence annotations in
the CAS
- Token - The analysis engine requires Token annotations in the CAS
- Outputs
- Phrase - The analysis engine creates a Phrase for each phrase tag
produced by the TreebankParser. The actual annotations created
are sub-types of Phrase, specific to the actual phrase tag. See
the ParseTagMapping parameter for more details.
- Clause - The analysis engine creates a Clause for each clause tag
produced by the TreebankParser. The actual annotations created
are sub-types of Clause, specific to the actual clause tag. See
the ParseTagMapping parameter for more details.
- Parameters
| Name |
Type |
Description |
| ModelDirectory |
String |
Path to the directory that contains the OpenNLP
model files for the English parser. |
| UseTagDictionary |
Boolean |
Flag indicating whether or not to use the tag
dictionary |
| CaseSensitiveTagDictionary |
Boolean |
Flag indicating whether or not the tag dictionary is
case sensitive |
| BeamSize |
Integer |
The beam size for the parse search |
| AdvancePercentage |
Float |
The probability mass percentage threshold for
advancing outcomes |
| ParseTagMappings |
String Array |
Mapping from parse result tags produced by the
TreeBankParser to the JCas class for the corresponding
annotation. Each mapping string is of the form "tag,class",
i.e., the tag name followed by a comma followed by the annotation
class name. |
Tips and Traps
- The OpenNLP Tools can require a lot of Java heap memory, especially
if you run multiple annotators simultaneously. You'll likely want
to increase your maximum heap size with the -XmxSize command line
argument to the JVM. Try -Xmx1024M just to be safe. If you
are using an Eclipse run configuration for the UIMA SDK tools (Document
Analyzer and CPE Configurator), you can specify this VM argument on the
"Arguments" tab of the run configuration.
- The jar files that come with the OpenNLP Tools package may have been
compiled with Java 1.5. Although you can compile the UIMA wrappers
with Java 1.4, if you try to run your UIMA application (e.g., the
Document Analyzer) with Java 1.4 and you get a
"java.lang.UnsupportedClassVersionError: ... (Unsupported major.minor
version 49.0)", try running your application with Java 1.5.
- To train new models for the OpenNLP components, see the README file
distributed with the OpenNLP Tools package.
- Note that OpenNLPTokenizer requires Sentence annotations, and
OpenNLPPOSTagger, OpenNLPNEDetector, and OpenNLPParser require Sentence
and Token annotations, so in most cases you will be running an aggregate
that minimally includes OpenNLPSentenceDetector and OpenNLPTokenizer.
- The models for the OpenNLP name finder and parser were created using
a tokenization produced by the OpenNLP tokenizer. If you use a
different sentence detector and tokenizer that produce a tokenziation
diffenrent from the Penn Tree Bank standard, you may not get the best
possible performance from the name finder and parser.
|