ENVIRONMENTS: a standalone command line application capable of identifying environment descriptive terms, such as "coral reef, cultivated land, glacier, pelagic, forest, lagoon", in text.

The Environment Ontology (EnvO) a community resource offering a controlled, structured vocabulary for biomes, environmental features, and environmental materials, serves as the source of names and synonyms for such identification process.

Given a folder with plain text files, ENVIRONMENTS based on its name and synonym dictionary reports the detected environment descriptive term, its start and end position in each document, and the corresponding Environment Ontology identifier.

To improve detection ENVIRONMENTS allows for orthographic variation in the way the terms are written (e.g. plural forms and spacing/hyphenation like in "freshwater", "fresh-water", and "fresh water").


Image source: http://greens.org.au (left), Flickr by Nick-K (center), Wikipedia by Onderwijsgek

Described in: ENVIRONMENTS and EOL: identification of Environment Ontology terms in text and the annotation of the Encyclopedia of Life. Pafilis E, Frankild SP, Schnetzer J, et al. (2015) Bioinformatics (btv045) [HTML] [PDF]

Biological Scenarios

  • The ENVIRONMENTS-EOL Project

    ENVIRONMENTS and EOL: From Plain Text to Enriched Encyclopedia of Life (EOL) Contents
    A project aiming at processing the EOL Taxon pages to extract descriptions of their environmental context. Such input will subsequently employed to answer integrative large-scale biological questions (Funded by the EOL Rubenstein Fellows Program)

  • SEQenv: From Signals to Environmentally Tagged Sequences
    A pipeline capable of annotating genetic sequences based on environment descriptive terms occurring within their records and/or in relevant literature.

Availability: the ENVIRONMENTS software (under BSD license) along with its dictionary and its stopword list (both under CC-BY license) is available here. The gold standard corpus used for the ENVIRONMENTS evaluation can be downloaded from here. The corpus is based on Encyclopedia of Life Taxon pages and is distributed under the CC-BY-NC-SA license.

Sister Project: SPECIES, a command line application capable of identifying taxonomic mentions in documents.

Team: Evangelos Pafilis#, Sune Frankild*, Lucia Fanini, Sarah Faulwetter, Christina Pavloudi, Katerina Vasileiadou, Christos Arvanitidis, Lars Juhl Jensen*# (*: main software developers, #: correspondence)

Maintained: at the Novo Nordisk Foundation Center for Protein Research (NNFCPR), Denmark, and the Institute of Marine Biology, Biotechnology and Aquaculture (IMBBC), Hellenic Centre for Marine Research (HCMR) Crete, Greece