r. alexander milowski, geek

My mood...Welcome to the web home of Alex Milowski. Here you'll find information about me, some of the software I've written, and the projects in which I participate. You'll find variety of Mathematics, technology, papers, and presentations on this website written or contributed to by me. If you have an questions or comments, please don't hesitate to contact me.

Recent Entries

Unified Content Descriptors

The International Virtual Observatory Alliance (IVOA) is an organization that helps set the technology standards used by Astronomers on the Web to exchange information.  One interesting aspect of astronomical data is that how the data was collected is as important as the particular measurements or images of specific targets.  As such, when information is exchanged, semantics about what particular columns of data actual mean and how they related to each other (e.g. an error estimate for another column) is very important.

The promise of Semantic Web technologies has been that we can encode these semantics but the idea that we will encode all specifics into URIs, especially when some future combinations are unknown, is a daunting task.  What are the standards, conventions, and naming idioms that we will use to accomplish that?  How will we entice people to use this URIs and will they be convenient enough that they won't develop their own?

The IVOA has an interesting take on this coding problem called Unified Content Descriptors (UCD)  which has some interesting features.  First, there is always a base value: are we measuring photometric magnitude, statistical error, angular momentum, the time observed, or is this just a value like a URI?

Second, semantics are conveyed by a combination of words: photometric magnitude + statistical error = the statistical error for the measured photometric magnitude.  In general, a UCD is a sequence of words that are used to build up a concept.

Third, words are build from atoms that are universally accepted.  Thus, when words are created em always means Electromagnetic Spectrum and dec is always declination.  This brings some consistency to the construction of words.

UCD Syntax

The rules are simple:

  1. Atoms are simple short tokens, hyphens are discourage, and periods and semicolons are avoided.
  2. Words are formed from atoms that are separated by periods.
  3. Words start with the most general and proceed left-to-right to the more specific.
  4. A UCD is a sequence of words separated by semicolons.
  5. A UCD always starts with a primary word that represents the base value (e.g. a photometric magnitude reading or a statistical error quantity).

For example, the measurement of magnitude in the J band is the UCD photo.mag;em.IR.J while the statistical error for that measurement is stat.error;photo.mag;em.IR.J.

Of course, this works for the IVOA community because they decide on the atoms and words in advance and they have a process for continuously modifying that list.  You can see the current list online and they have a wiki that describes the whole process.

Application to RDFa

What I want to consider is how this useful and successful concept of a UCD can be used by RDFa vocabularies.  Specifically, how do we embed these into URIs and still retain the flexibility they represent for recombining known words into new content descriptions.

Fortunately, according to RFC 2396, Uniform Resource Identifiers (URI): Generic Syntax, §3.3 Path Component, UCDs are completely valid in a path segment.  The primary word becomes the segment and each subsequence word become parameters.  In my thinking, that works quite well for me because they really are parameters to the base value represented by the primary word.

That means the IVOA could easily encode each UCD into a URI just by prefixing with a known URI:

http://www.ivoa.net/ucd/photo.mag;em.IR.J
http//www.ivoa.net/ucd/stat.error;photo.mag;em.IR.J

From an RFDa usage perspective, I want to take this into other application domains.  So, my next question is essentially: how would this idea change or solve problems in existing vocabularies like schema.org?

MarkLogic World 2013: mesonet.info: A Large-scale Weather Database for Citizen Science

This presentation discusses an architecture designed to receive weather reports and provide them on the Web, in real-time, using MarkLogic. The raw data comes from The Citizen Weather Observation Program which generates about 55K weather reports an hour from around the world but is not available on the Web. The architecture uses XProc (XML Pipelines) to both import and serve the data while MarkLogic scales the geospatial database. The trials, failures, recoveries, and eventual solution for a system that can scale while also providing real-time access for slicing and dicing by spatial and time dimensions will be presented. Several demonstrations of visualization and computation of scientific results will be shown.

What is the Subject Origin?

RDFa allow annotations of subjects (identifiers) to exist in multiple locations within a document. When a user tries to retrieve elements by this subject identifier, what element is returned? Currently, the RDFa API says that all the element origins in the document identified via @about, @resource, @src, @href are returned by the document.getElementsBySubject() API method.

For example, consider this example using RDFa:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml">
    <head>
        <title></title>
    </head>
    <body vocab="http://www.example.com/">
       <div about="_:ex1">
          <span property="a">v1</span>
       </div>
       <div about="_:ex1">
          <span property="b">v2</span>
       </div>
       <div resource="_:ex1">
          <span property="c">v3</span>
       </div>
       <div about="_:ex1" typeof="T">
          <span property="d">v4</span>
       </div>
       <div resource="_:ex1" typeof="T">
          <span property="e">v5</span>
       </div>
    </body>
</html>       
    

This example generates the triples:

<origin.xhtml>	<http://www.w3.org/ns/rdfa#usesVocabulary>	<http://www.example.com/>
@prefix e: <http://www.example.com/> .
<_:ex1> rdf:type <http://www.example.com/T> ;
        e:a	"v1" ;
        e:b	"v2" ;
        e:c	"v3" ;
        e:d	"v4" ;
        e:e	"v5" .

What is the subject origin?

  • Five div elements use the subject _:ex1.
  • Each child span element generates a different property.
  • Two separate elements type the subject as http://www.example.com/T.

With the current RDFa API, all of these div elements should be returned by document.getElementsBySubject("_:ex1) and two by document.getElementsByType("http://www.example.com/T"). Also, each property is generated by a different set of descendants of each subject element origin.

Obviously, this particular example is quite pathological. That said, the ability to have subject annotations in different places within a single document is a good feature that is useful when the content doesn't follow a tree structure. As such, in practice, something like this will happen for good reasons.

In contrast, it probably isn't a good idea to type a subject in different locations. The resulting annotation graph (i.e. triples) are the same but it just isn't necessary to do as the same thing is being said twice. Alas, it is possible and is likely to happen in some document somewhere on the Web.

What does this mean for an easier API?

There are some hard bits here in that getting subject and typed element origins always needs to return an array of elements. This means that for simple annotations where there is only one subject/typed element origin, the API has a cardinality mismatch. I'm not sure what the right answer is but always having to de-reference an array is unfortunate.

Further, if we want to have an RDFa API object accessible on the element origin, as Green Turtle does in a limited way, then this object/element pair needs to have three properties:

  1. The element must be the origin of a single subject.
  2. The same RDFa API object must be accessible from all subject origins.
  3. The same subject properties (i.e. subset of the annotation graph) must be available.

I believe that (1) is satisfied by the way subjects are generated from the RDFa attributes in section 7.5 of the RDFa 1.1 Core specification. As such, the same object can be presented via the API to the consumer for that particular subject. This also helps satisfy (2).

The third part is essentially acknowledging that this object is a jumping off point where a consumer is likely to access any number of properties of the subject. As such, you should get all the properties regardless of whether they are actually specified on that particular element or its descendants. That is, from a usability perspective, it doesn't make sense to restrict it to those derived from the descendants.

Further refining this, generating the subset of properties only exhibited by that element is more computationally expensive. The regular element DOM will tell you the authored properties if you just look at the descendants' use of the RDFa attributes. As such, authoring tools can determine this in better ways.

What next?

My proposal is that every subject origin have a data property that is the RDFa API object. The id property of this object returns the subject URI. Further methods on this object should allow access to the subject properties (i.e. subset of the annotation graph).

The next problem is how to make accessing properties and values easier for scripting.

[More entries ...]