Welcome to the web home of Alex Milowski. Here you'll find information about me, some of the software I've written, and the projects in which I participate. You'll find variety of Mathematics, technology, papers, and presentations on this website written or contributed to by me. If you have an questions or comments, please don't hesitate to contact me.
Problems with Microdata [view comments][permalink]
Given my recent grumbling about Google Blink developers trying to remove XSLT from Chrome, the current state of Microdata vs RDFa makes me think we're creating another level of incompatibilities on the Web. XSLT generally failed client-side on the Web due to the poor implementation within the various browsers. At this point, of course people aren't using it en masse because it just didn't work.
Microdata and its unknown status makes me feel like we're headed for more broken, partial
ideas on the Web. We'd get proper adoption of some kind of
Semantic Web Annotation
if there was one syntax that everyone was using. Yet, the schema.org folks (Google et. al.) keep pushing Microdata. I'd personally prefer RDFa.
It is remarkable that while the W3C members seem to have chosen to only have one specification
(RDFa!), schema.org seems to keep chugging
along with Microdata. The status of whatever
Microdata really is within the W3C or
on the Web is really unknown--even to people like myself who try to keep close tabs on such
What is Microdata?
Attempting to answer the question
What exactly is Microdata? only leads to more
questions. First, which specification are we going to use: the 24 May 2011 draft linked to from the
schema.org page, the latest draft dated 25 Oct
2012, or the WhatWG
draft that can change randomly?
The real problem is that this specification is in limbo as it has been derailed from a
recommendation track document. Essentially, there was push back,
some very well thought out arguments against Microdata (e.g. by Manu Sporny), and a task force for
some kind of unification or co-existence (read Jeni Tennison's blog post about
There's a lot of passion packed into this debate and, yet, after all this time, nothing has really changed that much. It is unclear what the status of Microdata is at the W3C and schema.org just plugs on ahead assuming that we all know and want this thing called Microdata.
My Problems with Microdata
While I have huge issues with the process that has led us into this mess, as the implementer of Green Turtle, I have many technical issues when it comes to implementing Microdata (assuming I can pick a spec against which to implement):
- Lack of a default vocabulary mechanism. Inferring property value URIs from an
overt type (e.g. the
itemtypeattribute) feels like a pure hack. It works in closed type systems like schema.org but it doesn't in general. RDFa has the
vocabattribute that allows an author to declare a default. That feels like a missing feature for Microdata.
- Items without types? What's the point of having an untyped item? It isn't useful to schema.org nor to any other comparable use. All a processor knows is that there is an item with some properties whose unique names can't be determined. Of course, this is the red herring of Microdata: oh look, all you really need is a itemscope attribute (unless you want to do something useful and then you need an itemtype attribute too).
- No Shorthand Mechanism. Prefixes and default vocabularies are
uglyuntil you have real data with very verbose type and property URIs mixed from different real-world vocabularies. At that point, prefixes to shorten URIs (e.g. CURIEs) seem like a really nice feature.
- No Tests!!! A specification without tests guarantees interoperability problems. There are no well-defined, publicly available, and agreed upon tests for Microdata. That's a real problem.
- No defined Semantic Web mapping. If Microdata is to exist on the Web, it really needs to play well with others. Specifically, as it is adding semantic annotations to Web pages, it needs to have a well-defined mapping into triples and the Semantic Web. That's a real problem for Microdata because you can annotate items without ever giving a context for identifying the vocabulary. The result is that some Microdata maps to triples well and some does not without a great deal of assumptions.
I could go on but I'll stop there.
Clean Up Your Mess!
Bing, Google, Yahoo, and Yandex ran on ahead with Microdata on schema.org. Because Google is there (and possibly the others too), everyone just assumes Microdata has some legitimacy. Maybe it does or maybe it is a disaster.
I have one thing to say: clean up your mess! If my 8 year old son spills his milk, he gets a towel (sometimes at my suggestion) and cleans up his mess. Google et. al. has spilled the trash can of partially-abandoned specifications onto the Web and they need to get a towel and clean it up.
Green Turtle - Injection and Microdata Options [view comments][permalink]
The way that the script on the Web page and the extension talk to each other has been greatly improved. By doing so, I got rid of a hack where it used a meta element to pass triples to the extension. The new method has the additional benefit of supporting other extensions--possibly in other browsers--with the same technique.
Microdata packaging and option.
A packaging of Microdata has been added to both the script and the extension. The experimental Microdata processor is now included with Green Turtle and can be enabled via a simple switch:
Including the above script just after the Green Turtle script will turn on Microdata support (off by default). You can also disable Turtle processing much the same way (on by default):
Any Microdata is turned into triples and added the the graph along with any RDFa or Turtle found in the document. If the Microdata can't be turned into triples, it is just ignored. It would be really nice if there was a specification for this but there really isn't as of yet.
Microdata is also packaged with the extension and is off by default. You can enable this feature in the extension options page.
The other big new feature is injection of the processor into documents that don't already have Green Turtle. When the extension detects a Web page that does not have Green Turtle, it will inject the script from a version stored in the extension. The result is that
document.data and all the APIs are available from the console. This will enable you to do additional experimentation in the console or for other extensions to use the graph data.
This also provides a more efficient means for handling large documents. This feature is on by default but you can turn it off in the extension options page. If you turn it off, it will still try to harvest triples but they won't be available from the console.
The extension now has an options page. Just go to the
Extensions menu item in Chrome and click on the
Options link next to Green Turtle. Checking
Enable Microdata will automatically enable Microdata for all injected scripts. There is also a control for turning off script injection.
Disk Soup: AWS, EBS, RAID, MarkLogic, and Pinch of Salt! [view comments][permalink]
At the 2013 MarkLogic User Conference, I learned all kinds of interesting and valuable information about running MarkLogic on AWS (Amazon Web Services) EC2 servers. Most specifically, it was mentioned that I wasn't necessarily going to get a huge performance gain over regular EBS storage via the RAID 10 configuration that I cooked up. That was good news to me because it costs me quite a bit to have all that extra EBS storage for RAID10.
I just finally got around to testing all of this out with live data. I trimmed down my data, merged all my forests, and cleaned up the disk to ensure I knew exactly how much storage I needed. I finally got it all down to about 148GB of on-disk data for about 3+ months of weather data.
My current configuration is eight 200GB volumes arranged in a RAID 10 configuration. That is 1.6TB of storage that yields about 750GB of usable disk space.
To consolidate this onto one volume, I created a 600GB EBS volume, created an ext4 filesystem, and copied all the data across while everything was shutdown. And then I waited, and waited, copying is sure slow, and waited...
When I was finally ready, I started up MarkLogic and all my Web applications to test the throughput. The result: it was twice as slow! I get at least a 2 times increase in performance by having RAID10 via mdadm.
Fortunately, the data hadn't changed and so I could easily switch back to the old filesystem. I restarted MarkLogic and verified my measurements: yes, RAID10/mdadm is better by at least twice.
I then looked into Provisioned IOPS and whether I could test that. Unfortunately, it isn't available for the instance type I'm using (m2.xlarge) and I would have to move to the next level up (m2.2xlarge). The additional cost of Provisioned IOPS for EBS and the m2.2xlarge removes any cost savings I might have had.
Here's the takeaway:
- RAID10 via mdadm is a good middle ground for AWS. It will give you better performance, possibly twice as fast as regular EBS storage.
- RAID10 will cost you less for overall EBS storage than Provisioned IOPS.
- Provisioned IOPS will give you better performance guarantees and you may find you want/need to pay for that.
- I don't have a measurement of that as of yet.
I wish I had an easy way to test out Provisioned IOPS for EBS storage with my system. It would be great to compare everything all at once. Unfortunately, I would first have to upgrade to a different instance type and then re-run all the tests I've done so far. For my current work, that isn't necessary.
Yet, when I want more performance, I now know what to do next.
[More entries ...]