r. alexander miłowski, geek

[X]

Projects

Do Elements have URIs?

I was discussing a problem with triples generated from RDFa and the in-browser applications I have developed using Green Turtle with a learned colleague of mine whose opinions I value greatly. In short, I wanted to duplicate the kinds of processing I'm doing in the browser so I can run it through XProc and do more complicated processing of the documents. Yet, I rely on the origin of triples in the document for my application to work.

His response was just generate a URI and I pushed back a bit. I don't think of the origin of a triple as a thing that is easily named with a URI and I need to explain why I believe that.

Why do I care? Because I do things with RDFa in the browser (a simple example and a complicated one) and sometimes I want to do the same thing outside of the browser; other tools are failing me right now.

A Bit of History

Some of you might remember XPointer Framework that provided a mechanism for embedding a pointer into the fragment identifier. In theory, you can point to specific elements by using tumblers (e.g., /1/2 is the second child of the first element) or by a pointer (i.e., an XPath expression) but you might need to deal with the complexity of whatever namespaces your document uses. The result is something that might not be so easy to parse, cut-n-paste, or otherwise manipulate but it should work.

Yet, we really don't have XPointer functionality in browsers except possibly in relation to some minimal form necessary for SVG's use of XLink. Some of it might have to do with the complexity involved and diminishing returns. That is, people have gotten along with naked fragment identifiers and the id attribute for quite awhile. Others have usurped the fragment portion of the URI for other nefarious purposes (e.g., application state).

Nothing is Free

In the browser, there is no support other than for naked fragment identifiers that map to HTML's id attribute. We don't even have consistent xml:id support within the browsers. Not to mention, there is the conflict of HTML's id attribute and xml:id when serializing as XML syntax. Keep in mind, developers have to implement whatever we cook up and time or mind share is not on XPointer's side.

The net is that we get nothing for free and we have little to rely upon.

Fragile Pointers

There is probably an implicit rule for the Web:

If you want someone to point to your content, you should give it an identifier.

We learned that with links on the Web and gave things unique URIs. We then we learned that we need to assign identifiers to portions of content within Web resources for similar reasons. Extra identifiers don't hurt and they give people the ability to point at specific things in your content. Thus, having a good scheme for a liberal sprinkling of identifiers is a good idea.

Unfortunately, thoughtful content doesn't always happen. Some might say that it rarely happens. As such, if you want to point to specific things and they don't have identifiers, you are out of luck. XPointer was suppose to help solve that and you didn't get it.

But my original problem is not about linking and is instead about tracking origins during the processing of the document. The RDFa API that Green Turtle implements provides the ability to get elements by their type or specific property values. This allows the ability to write applications that process elements based on their type and various other annotations to go between the annotation graph of triples and the document to make things happen in the very same document.

I don't want to generate a URI, nor a pointer, and doing so feels like work around. It is a result of a system that isn't designed to track origin or, dare I say, provenance.

Provenance?

In my opinion, the origin of a triple isn't the common use of the term provenance as used in many Semantic Web communities. Often, provenance means the Web resource from whence the triple was generated and not the element node. To complicate this, provenance can also mean the earliest known history and so the term is very overloaded.

A triple in RDFa originates from a particular element. In a few cases (e.g., typeof attributes with more than one type), an element can generate more than one triple. Meanwhile, in reverse, every triple from RDFa annotations has a single element node that is its origin.

Thus, I prefer origin over provenance so that I can avoid the overloaded and confusing use of the word provenance in both industry and research.

Interoperability?

From any Web resource annotated with RDFa you can generate Turtle or JSON-LD output that represents the triples harvested from the document. Unfortunately, we lose any information about the origin of a triple unless we generate more information. Such additional information would need to have a URI or pointer to the element from which the triple was generated. That brings us full circle and left holding an empty bag.

Any tool that processes the RDFa directly has this information when it harvests the triples. Within that context, we can use that information, just like Green Turtle does, to provide the application a way to traverse between the annotation graph of triples and the document from whence they came. Unfortunately, this seems to be a different model ftp, what many systems have implemented.

In the end, I am less concerned about interoperability, mainly because it is my tool chain that I am using to process information. I'll use whatever tools that work and I don't intend to expose the intermediate states onto the Web. Those might be famous last words, so I'll take some I told you so tokens in advance.

Still Searching for a Solution

I don't have a solution to do this right now. I'm tempted to use PhantomJS or node.js to run my application as if it was in the browser and then process the output with XProc. This would satisfy my main use case of post-processing the results into static variants for various purposes.

I would like to put this content into MarkLogic and run some of the processing there, but they don't support RDFa and they don't have a notion of an origin of a triple. It would be ideal to have this within the context of a database because the origin is a node and storing an internal reference should be straightforward (but I'm guessing). I bet I could hack up eXist and make it do this for me too.

Right now, I have too much to do. The applications work in the browser and I'll let the dust settle for the rest of it. Maybe I'll find a clever solution somewhere in the near future.

GeoJSON to the Rescue (or not)!

This is the fourth entry in my series on my PhD dissertation titled Enabling Scientific Data on the Web. In this entry, we will explore GeoJSON as an alternate approach to geospatial scientific data.

What is GeoJSON?

GeoJSON is a format developed for encoding a variety of geographic data structures. It is feature-oriented, just like KML, and can replace the use of KML in many, but not all, Web applications. The encoding conforms to standard JSON syntax, with an expected structure and set of property names.

A GeoJSON Object Containing Two San Francisco Landmarks
{ "type": "FeatureCollection",
  "features": [
      {"type": "Feature",
       "properties": {
          "name": "AT&T Park",
          "amenity": "Baseball Stadium",
          "description": "This is where the SF Giants play!"
       },
       "geometry": {
          "type": "Point",
          "coordinates": [-122.389283, 37.778788 ]
       }
      },
      {"type": "Feature",
       "properties": {
          "name": "Coit Tower"
       },
       "geometry": {
          "type": "Point",
          "coordinates": [ -122.405896, 37.802266 ]
       }
    }
  ]
}

A GeoJSON object starts with a feature collection and each feature is a tuple of a geometric object, an optional identifier, and a property object. The geometry object describes a point, line, polygon, arrays of such objects, or collections of mixed geometry objects.

The properties property of the feature is any JSON object value. In the example shown above, it defines a set of metadata for each point that describes a location in San Francisco. If the property names match the expectations of the consuming application, it may affect the rendering (e.g. a map marker might be labeled with the feature name). There is no standardization of what the properties property may contain other than it must be a legal JSON object value.

GeoJSON at the USGS

The US Geological Survey (USGS) provides many different feeds of various earthquakes around the world as GeoJSON feeds. Each feature is a single point (the epicenter) and an extensive set of properties is provided that describe the earthquake. The property definitions is defined on the USGS website but their use is not standardized.

An Earthquake Feed Example
{"type":"FeatureCollection",
 "metadata":{
     "generated":1401748792000,
     "url":"http://earthquake.usgs.gov/earthquakes/feed/v1.0/summary/significant_day.geojson",
     "title":"USGS Significant Earthquakes, Past Day",
     "status":200,
     "api":"1.0.13",
     "count":1
  },
  "features":[
     {"type":"Feature",
      "properties":{
         "mag":4.16,
         "place":"7km NW of Westwood, California",
         "time":1401676603930,
         "updated":1401748647446,
         "tz":-420,
         "url":"http://earthquake.usgs.gov/earthquakes/eventpage/ci15507801",
         "detail":"http://earthquake.usgs.gov/earthquakes/feed/v1.0/detail/ci15507801.geojson",
         "felt":3290,
         "cdi":5.4,
         "mmi":5.36,
         "alert":"green",
         "status":"reviewed",
         "tsunami":null,
         "sig":806,
         "net":"ci",
         "code":"15507801",
         "ids":",ci15507801,",
         "sources":",ci,",
         "types":",cap,dyfi,focal-mechanism,general-link,geoserve,losspager,moment-tensor,nearby-cities,origin,phase-data,scitech-link,shakemap,",
         "nst":100,
         "dmin":0.0317,
         "rms":0.22,
         "gap":43,
         "magType":"mw",
         "type":"earthquake",
         "title":"M 4.2 - 7km NW of Westwood, California"
      },
      "geometry":{
         "type":"Point",
         "coordinates":[-118.4911667,34.0958333,4.36]
      },
      "id":"ci15507801"
    }
  ]
}

It is quite easy to see that when this data is encountered outside of the context of the USGS, the property names have little meaning and no syntax that identifies them as belonging the USGS.

Out with the Old, in with the New

Just replacing KML's XML syntax and legacy structures from Keyhole with a JSON syntax doesn't address much other than making it easier for JavaScript developers to access the data. There are plenty of mapping tool kits, written in JavaScript, that can readily do things with GeoJSON data with minimal effort and that is generally a good thing. Many can also consume KML as well and so we haven't necessarily improved access.

The format is still oriented towards map features. If you look at the example above, you'll see that the non-geometry information overwhelms the feature information. If you want to process just the properties, you need to enumerate all the features and then extract (access) the data. Because JSON results in a data structure, GeoJSON makes this a bit easier than KML and is an obvious win for this format.

Remember that we are still looking at scientific data sets and scientists love to make tables of data. The USGS earthquake feed is a table of data that happens to have two columns of geospatial information (the epicenter) and 26 other columns of data. Yet, we are forced to a map-feature view of this data set by the choice of GeoJSON.

Keep in mind that the OGC says this about KML:

The OGC KML Standard is an XML grammar to encode and transport representations of geographic data for display in an earth browser, such as a 3D virtual globe, 2D web browser application, or 2D mobile application. Put simply: KML encodes what to show in an earth browser, and how to show it. [OGC Reference Model, 2011]

We could say almost the same thing about GeoJSON except that it doesn't say what to do with the properties. There is only an implied aspect of GeoJSON that the features are rendered into map features and then the properties are displayed somehow. That somehow is left up to the Website developer to code in JavaScript.

Does JSON-LD Help?

GeoJSON is fine for what it does and doesn't do, but it probably shouldn't be used to exchange scientific data. It lacks any ability to standardized what to expect for as data for each feature and such standardization isn't the purview of the good folks that developed it. We might be able to place something in the value of the properties property to facilitate syntactic recognition of specific kinds of data.

One new thing that I am considering exploring is a mixed model where the properties object value is assumed to be a JSON-LD object. This allows the data to have a much more rich annotation and opens the door to standardization. Unfortunately, this is still on my TODO list.

What is next?

I'm just about done with formats for scientific data. There are many, many more formats out there and they suffer from many of the same pitfalls. Up next, I want to address what it means to be on the Web, address some architecture principles, and describe some qualities we want for Web-oriented scientific data.

Geospatial Data and KML

This is the third entry in my series on my PhD dissertation titled Enabling Scientific Data on the Web. In this entry, we will explore KML and how can (or can't) be used to disseminate geospatial scientific data.

What is KML?

In their own words:

The OGC KML Standard is an XML grammar to encode and transport representations of geographic data for display in an earth browser, such as a 3D virtual globe, 2D web browser application, or 2D mobile application. Put simply: KML encodes what to show in an earth browser, and how to show it. [OGC Reference Model, 2011]

Keyhole Markup Language (KML) is the markup language read by the Google Earth browser and technology that they acquired from Keyhole, Inc. in 2004 (no surprise there). In 2008, Google submitted KML to the Open Geospatial Consortium (OGC) for standardization and it was approved, largely unchanged.

Like many other formats that serve a similar role, KML is feature-oriented. It allows users to describe place markers, polygons, and other spatial features and then attach metadata to these features. It serves in an essential role in the ability to build maps out of layers of data.

Using KML for Data

My primary criticism of KML (and other such formats) is that it is focused on the map feature and not on the data that might be associated with it. In KML, data representations via the ExtendedData element feel like an afterthought. Even the description element is just a plain string and HTML is encoded as escaped markup; an architectural limitation and poor design decision.

For example, here is the first Placemark element for snow readings from the NOAA:

<Placemark><description><![CDATA[<b><font size="+2">
41P07S: American Creek</font></b><hr></hr>
<table cellspacing="0" cellpadding="0" width="400">
<tr><td>Elev: 1050 feet</td></tr>
<tr><td>Snow Water Equivalent: 0 inches</td>
<td>Snow Water Equivalent: -99.9 pct norm</td></tr>
<tr><td>Water Year Precipitation: 4.2 inches</td>
<td>Water Year Precipitation: -99.9 pct norm</td></tr>
<tr><td>Snow Depth: 0 inches</td><td>Snow Density: -99.9 percent</td></tr>
<tr><td><a href="http://www.wcc.nrcs.usda.gov/cgibin/wygraph-multi.pl?state=AK&amp;wateryear=current&amp;stationidname=41P07S">Time Series Chart</a></td>
<td><a href="http://www.wcc.nrcs.usda.gov/nwcc/site?sitenum=1189">Site Info</a></td></tr></table>
<a href="http://www.wcc.nrcs.usda.gov/siteimages/1189.jpg"><img width="400" alt="img not available" src="http://www.wcc.nrcs.usda.gov/siteimages/1189.jpg"/></a><hr></hr>
Generated: 30 May 2014]]></description><Snippet></Snippet><name>American Creek</name><LookAt><longitude>
-141.225
</longitude><latitude>
  64.795
 </latitude><range>10000</range><tilt>35.0</tilt><heading>0.0</heading>
</LookAt><visibility>1</visibility>
<styleUrl>#blackdot</styleUrl>
<MultiGeometry><Point><coordinates>
-141.225,64.795,0</coordinates></Point><LineString><coordinates>
-141.23,64.8,0 -141.23,64.79,0 -141.22,64.79,0 -141.22,64.8,0 -141.23,64.8,0 </coordinates></LineString>
</MultiGeometry>
</Placemark>

Looks like markup you can parse, right? Look closer. See that <![CDATA[ inside the description element? Good luck!

Of many of the examples that I've surveyed, this is very common. The data is stuffed inside escaped markup in the description so that it looks good in an Earth browser. Little consideration is given to whether you can actually get the data (elevation, snow measurement, etc.). Only the geospatial feature is easily discovered (i.e., a point at 64.795, -141.225).

For the NOAA, you can get this data through other means. This requires applying for an account to get access to their data in a non-Web format (NetCDF) and then pulling it using their APIs. In essence, unless you really want it, it isn't readily available and doesn't meet my criteria for open data (e.g. open Web formats and protocols).

The NOAA could have provided this via the ExtendedData element but to do so they would have to duplicate the information they provide in the description. That is, in addition to having the description element show the values via escaped HTML, they would need to put it in Data elements as shown:

         <ExtendedData>
            <Data name="elevation">
               <value>1050</value>
            </Data>
            <Data name="snow">
               <value>0</value>
            </Data>
            ...
         </ExtendedData>
      

While some Earth browsers support rendering this information for the place marker, it is unclear whether all would do so. Thus, information would likely need to be duplicated between the human readable description and the tool processable ExtendedData element. That is certainly not an optimal design.

Of course, given my preference for RDFa, I'd rather we:

  1. Not have escaped markup in descriptions.
  2. Use RDFa annotations to avoid duplications or verbose markup.

Such an approach might be:

<Placemark>
<description>
<div vocab="http://noaa.gov/" typeof="Observation">
<h1 property="title">41P07S: American Creek</h1>
<table>
<tr><td>Elev: <span property="elevation">1050</span> feet</td></tr>
<tr><td>Snow Water Equivalent: <span property="snow">0</span> inches</td>
...
</table>
</div>
</description>
...
</Placemark>

Of course, we could make the annotations more comprehensive (e.g. include units of measure) and then they would be more verbose too. At least the data would be in one place and we wouldn't need to shoehorn the data into minimal markup that may not capture all its nuances.

Yet, this approach is unusable because KML processors would throw errors when the description contains markup. You aren't allowed to do that, by definition, so stop wishing for sanity!

Features or Tables?

My other major criticism of using KML to distributed data is that it is designed to render features in a map viewer. If your main task is to process the data and run some algorithm on it, this is not an optimal format. You have to process each geometric object, understand whether it contains data, and the extract the bits you need when all you really wanted is a table of data.

Many data feeds, like those from the NOAA, are point or simple polygon oriented and contain a set of measurements of the same types, repeated over and over again. In the example used, each set of measurements is taken at a single geospatial coordinate. The set of measured quantities are all the same kind and the same is true about the metadata (e.g. the elevation is always the same for each point).

Frankly speaking, a table of data is easier to process. Please, give me a table of data (I did say please).

What is next?

I could bore you all with describing the issues with GML (a whole bunch of XML Schema and then you are still not done), but I won't. I want to address GeoJSON and few other odd formats next before we start talking about solutions.

[More entries ...]