XBio:D Roadmap

From xBio:D Wiki
Revision as of 14:02, 3 October 2018 by Nfj (talk | contribs)
(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to navigation Jump to search

The Database

The core of the database is information found on the specimen labels: this includes place of collection, time of collection, who did the collecting, how the specimens were collected, the identification of the specimen. This is all linked together using the unique identifier for each specimen (the collecting unit ID). This ID then links to any information on where the specimen is deposited and any images (or other media) of the specimen. The database also stores information on the published literature.


IPT

The data in the database are publically available through our own portals. Additionally, the data are supposed to be regularly harvested and cached by data aggregators. These include the Global Biodiversity Information Facility (GBIF), iDigBio (iDigBio), and the SCAN network (SCAN). These aggregators do this by connecting to resources made available with the Integrated Publishing Toolkit (IPT), a Java program produced by GBIF. The scheme is that the database, at regular weekly intervals, produces a Darwin Core (DwC) file that contains the information we are sharing. Each resource we make available has a separate DwC file. We have, or intend to have anyway, a couple dozen such resources.


Specimage

First, this app is intended to be pronounced "spess-ee-maj", a mashup of the words "specimen" and "image." Fundamentally, this is simply an image management system. It differs from similar commercially available programs in that the specimen seen in each image is linked to its collecting unit ID. This ID then provides access to all of the information in the core database that is associated with that specimen. Specimage also has an upload function to add new images. During that process a thumbnail and a web-friendly JPG version of the original image are produced, and the user specifies the license under which the image may be distributed. The core database contains only pointers to the location of the actual images.


HOL

HOL - Hymenoptera On Line - is intended as a generic portal to the data we have. Text entered into the search box is interpreted in as many ways as possible: as a specimen ID, as the name of an organism, as a place name, as a person's name, etc. The results from these various options are presented as a series of tabs. Within each tab are expandable sections for predefined categories of information. Wildcards are accepted (% and _) as text input. Most of the information is live, i.e., directly extracted from the database and therefore as current as possible. Some summary information, however, is collated weekly and so may be slightly out-of-date. HOL also takes the lat/long coordinates for the places where specimens have been collected and uses GoogleMaps libraries to produce a map of those localities.

OJ_Break API

OJ_Break is the name of the API used to interact with many (ultimately all!) of the web-based data portals. The output of OJ_Break are JSON objects that can then be parsed and formatted for display by Javascript code in the webpage.

bioguid.osu.edu

The Biodiversity Informatics Standards group (TDWG) has set up and maintains (???) a vocabulary for the basic kinds of information that we share. In some of the data portals (e.g., HNS) we have the option of delivering the information in RDF. The domain bioguid.osu.edu is intended as a resolution mechanism for the (hopefully) globally unique identifiers that we support. In its original formulation TDWG recommend the use of life sciences identifiers (LSIDs) as the format for these identifiers. The community has now generally abandoned that format, and instead opted for stable URLs. The resolver software should be able to handle both formats. Unfortunately, bioguid.osu.edu presently seems to be offline.

HNS

HNS is the Hymenoptera Name Server (HNS). Its function is to provide basic information associated with a taxonomic name. The code for this portal is actually compiled and stored within the database itself. It does not make use of the OJ_Break API.

osuc-mgr

The osuc-mgr (database manager) is a set of forms used to enter or edit information within the database. It is protected by a username/password combination, and roles for different users are specified. It uses the OJ_Break API. This app is designed for entering/editing individual pieces of information. For batch input of specimen information see the description of DEA below. Information in the database may be edited in osuc-mgr, but not deleted outright.

DEA

DEA, the Digital Entry Assistant (DEA), is intended as a means of batch upload of specimen collecting data. Users first transcribe specimen data into an Excel spreadsheet (template provided). Each row in the spreadsheet is a different specimen, and the columns are the attributes to be associated with the specimen. This spreadsheet is then used as the input to DEA. DEA is a set of Python scripts written in Django framework. Internally, a mySQL database stores information about data set uploads, status, etc. DEA takes the input, parses it, and then checks to see if the individual pieces of attribute data (such as the country in which a specimen was collected) are already in the database. If not, then the user is prompted to enter those missing pieces (via osuc-mgr). When all instances of missing data are resolved, DEA then manages the upload and proper storage of all of the information.

vSysLab

vSysLab is the virtual Systematics Laboratory. At its core it stores information within the database to produce a matrix of species (the rows of the matrix) and their characteristics or attributes (the columns of the matrix). The key feature that distinguishes it from similar programs is that the names of the species provide links to the other information in the database, such as all the specimens with that name and all of the places that those specimens have been collected. vSysLab allows the user to link the textual descriptions of the atributes to images (via Specimage). It provides tools to manage species and characters, as well as a variety of modes of formatting the output.

Literature Library

We are shifting from maintaining a hard-copy library of publications, to one of PDFs only. We currently have over 10,000 publications in the library. The database stores pointers to the actual storage location of the PDFs. PDFs are uploaded through an osuc-mgr form.

Hardware Layout

With our ongoing shift to a virtual environment, most of this description is obsolete. However, it may help to understand the original crosstalk between computers. We eventually ended up with 3 computers running the system, one running Linux (128.146.250.24), an old Windows desktop (128.146.250.117), and a newer Windows Server (128.146.250.252). These had multiple domain names associated with them, just to help make it all more confusing:

252:

  • osuc.osu.edu: Collection's web server
  • specimage.osu.edu: Specimage, both the code as well as the images themselves
  • wasps.osu.edu: Johnson lab web server
  • osuc.osu.edu/DEA2: DEA

24:

  • osuc.biosci.ohio-state.edu: the Oracle database
  • osuc-mgr.osu.edu: the osuc-mgr set of forms
  • hns.osu.edu: Hymenoptera Name Server (HNS)
  • hol.osu.edu: Hymenoptera On Line (HOL)
  • vsyslab.osu.edu: vSysLab
  • bioguid.osu.edu: Globally unique identifer resolving service (not absolutely sure that it was here. We need to find and restore it!)

117:

  • hymfiles.biosci.ohio-state.edu: PDF libraries
  • xbiod.osu.edu: IPT (in ./ipt), xBio:D wiki (in ./xbiodWiki)

I don't see any communication going from 117 to 24. However, all other combinations seem to be there: 24 -> 252, 24 -> 117, and 252 -> 24.