Knut-Olav Hoven

Sunday, October 2, 2011

Layers of referentiality

Most computer systems interact with other systems in some way, consuming or providing data, or maybe both. Referential resources is the core of the Semantic Web.

There exists many ways to identify a piece of data. In a database, in a table, a row represents an entity of data, and that entity is identified with a value, most likely by a number or short string value. These kinds of IDs are extremely local to the system owning the data. They provide no meaning outside that they are some kind of IDs for something, and if exported from the system they become nothing but noise.

IDs exist on different levels, from the most simple IDs of a database entity, to globally unique resolvable resource identifiers, such as a URL. I present the layers of referentiality. Each layer up the stack provides a little more abstraction and a little more context to the identifier.

Layer	Referentiality	Identifier	Usage
Semantic Web	Globally unique resolvable ID	URL	Identifier of web resources. Provides information about how to access a resource.
Integration	Globally unique ID	URN/URI	Identifier of system resources. Contains a system identifer. Might be a URL. Should never change.
Domain	System local ID	URN	Identifier of domain concepts. References object type and id.
Persistence	Entity ID	Alphanumerical	Identifier of table rows, document file names.

An ID on the domain layer doesn't have to be a concatenation of the object type name and the primary key of the enitity, and should probably not be. A good ID on the domain layer is something that the domain experts (the users of your system) can relate to. And that is most likely not some automatically incremented ID in a database table.

Pay special attention to globally unique IDs of the integration layer – they should never change. URLs to websites and webservices are known to change often. How many times have you tried to access a URL just to find out it was a broken link? You should of course take precautions to avoid breaking the web, but some times it's inevitable for some good or bad reason.

A resource representation, for example an HTML document, might be retrieved in different ways. Probably the most obvious way to retrieve an HTML document is to make an HTTP GET request to a web server, resolved from a URL. Another way to retrieve an HTML document is to retrieve it from a database or file system, and then the URL from the first example becomes pretty useless.

You should focus on defining a strong namespace for your resources, to provide resource identifiers that are globally unique and very likely never to change. From a strong globally unique ID, in time, the Semantic Web will reveal itself.

Wednesday, September 14, 2011

Producing an HTML representation of an Atom feed with Java

Atom documents are XML documents structured according to the Atom Syndication specification. On Java, JAXB seems to be the recommended method to marshall Java object structures to XML streams. Together with a JAX-RS implementation for building RESTful web services, it can marshall and unmarshall between Atom XML documents and Java objects.

Lack of support in browsers

The content type of an Atom document is application/atom+xml, but most browsers seem to render Atom documents rather poorly. Some browsers pretend to be news aggregators, but on my quest for navigating my API all browsers come short. I've also got some resource representations of vendor specific types, take application/vnd.example+xml as an example. Trying to view content of this type, most browsers opens a "download file dialog" instead of displaying the XML structure. - I mean, come on, it's just XML!

Chrome has probably come closest to my needs with the plugin Advanced REST client Application. I want to follow the links of my Atom documents by clicking on links, for easy debugging and exploring of the API. This is where the Accept HTTP request header comes handy.

Browsers seem to favor text/html above all other content types, for humanly reasons. A simple XSL transformation from Atom documents to HTML might just do for now.

RESTeasy on JBoss

I use RESTeasy on JBoss 7 for my RESTful web-services to provide me with Atom XML document representations. It is possible to convert XML to HTML with an XSL stylesheet, either server side using XSLT in Java or on the client side by applying a xsl-stylesheet processing instruction to the XML document.

My attempt to add the xsl-stylesheet instruction failed, since the Atom JAXB provider doesn't support marshaller decorators using @XmlHeader and @Stylesheet annotations, like the regular JAXB provider in RESTeasy does.

JAX-RS allows a single resource to provide different representations based on request headers. Writing a MessageBodyWriter implementation to transform Atom XML from the Atom JAXB provider to text/html seemed like a good idea, but my attempt failed. My marshaller wasn't used at all. It seems that the JAXB provider grabs full control of output marshalling, but since JAXB does not support writing HTML it just fails miserably with an exception about not finding a ContextResolver for text/html.

Jackson did it, why can't I?

It might be possible to write a ContextResolver provider for the Atom Feed and Entry classes that can handle text/html. Maybe JAXB provider can utilize that to output HTML. My confidence in this is that the Jackson JSON Processor does the same. Looking forward till tomorrow.

Tuesday, September 13, 2011

Navigating resource representations with Atom documents

I've been browsing the web for some time looking for good documentation on Linked Data and Semantic Web. I've done some resource around the Atom and RDF formats, but I often lack the visibility from the resource itself and to the metadata representation.

I need some way of navigating my data. RDF, Atom and triples are good stuff, but they only describe a single resource and the links to other resources it relates to. For this case RDF seems like a bad solution since it doesn't reference other RDF documents. And basically, an RDF document doesn't need to be hosted on a web server at all.

Atom with its link element seems like a much better solution, since I can link to anything I like. Both Atom and RDF supports linking to thinks, even things not presented on the web, like a book. With Atom on the other hand, it can also reference other Atom documents, or even RDF documents.

Cool URIs are cool, but still not the bullet...

Linking with Atom documents looked like the nicest solution for me... until I came across the Cool URIs for the Semantic Web article over at w3.org. It's about accessing resources and metadata about them, to use RDF to describe a company or a person, with HTML for human readable presentation.

I found the article fascinating, but I didn't care much for the "you must use HTTP for identifying your resources" part. According to the article, the way to access resources is to request its URI, that must be a HTTP(S) URL using an Accept header of "application/rdf+xml" or alike, to request the wanted representation. If you wanted the HTML representation you ask the resource for "text/html". It walks through some scenarios using URI fragments, Content-Location headers and 303 redirects, with possibilities of making bookmarkable URLs for each representation. This is called content negotiation.

This content negotiation looks kind of like a good solution, but it gets harder if you want to store the metadata about the resource on a different system than the described resource, for example on a different domain. The article presents a solution for this too, in the case of HTML, where it's possible to reference back to the RDF metadata using the HTML link element with a relation type of "alternate" with the mime type set to "application/rdf+xml".

But isn't that circular references?

Ok, so with HTML we can reference back to the metadata referencing the HTML representation. That is called circular relations, and I find that a bad practice. You might consider the resource URI as a third reference here. If you define that as the master of all data for that resource you can get away with it, since it's mastering both the RDF and HTML representations and knows about them both.

My biggest problem with the back-referencing solution presented in the Cool URIs article is that I don't use HTML for presentation of my resources, but instead some other type of documents that are unfitted to provide the linking feature that both Atom and HTML has.

A plausible solution anyway?

So basically, to solve my problem using one of the possibilities presented by the Cool URIs article my webserver has to handle both the document representation and the metadata presentation. In my case, Atom is the preferred format, with a sane abstraction level.

For the moment I can live with that the same system presents both the representations.

The HTTP specification defines a Link response header that provides much the same features as HTML and Atom links do. On a request for a resource, regardless of the requested representation, I can supply the response with Link headers to all available representations for that resource.

Later it might also be possible to maintain a registry of external systems providing representations to the same resource. If we also finds a way of retrieving representations of things identified with non-HTTP-URI without content negotiation, now that would be a silver bullet, eeyy?

Monday, September 12, 2011

Making a scalable Apache access logs analyzer in Java

I'm thinking about writing a system for handling log files, especially Apache Tomcat access logs with the processing time field enabled. It will mainly be used to analyze request processing time and requests per second.

It doesn't have to be real time, as it's primary goal is to present change of performance over time.

The system might get hold on the access logs over HTTP, where an agent on the server can listen on a directory for newly rotated log files and submit those to the analyzer using HTTP PUT or POST.

By making a REST web-service with variable parameters for system, server and logfile per log submit, it allows segmentation per server and system. It also provides a nice known ID for each log file entry, allowing the agent to resubmit a log file in case of failure or updates.

For all this, I need some functionality:

For the log entry parser, I can utilize StringTokenizer to read each field of the log entries.

I also need some statistics reporters, multiple per log file. For graphing, I have some Perl scripts from an earlier project that can come handy. I might store raw output data in RRD files, with for example rrd4j, for later reporting. This is needed since I can't know about all needed reports and most likely need new ones after log files starts to flow in.

Processing log files and outputting different analysis reports if the perfect case for MapReduce, and Hadoop might be a nice choice as distributed architecture. The Cascading framework that sits on top of Hadoop looks very cool and works more like the UNIX way of piping and filtering.

To control the analyzing jobs I need some scheduler, and Quartz might do the job. New reports are generated periodically, but only if new log files have been submitted. I have to find some way to persist the job details, to allow downtime without losing data.

I am a little unsure about the storage backend. The log files can be stored compressed and uncompressed on the fly when streamed to consumers. For storing raw log files, a simple NoSQL storage might do the job very well, such as CouchDB, a document storage with streaming attachment support. Parsing, filtering, grouping and mapping data can be done in Java code in separate jobs. If those intermediate values from those jobs are persisted, then I don't need advanced queries from the storage backend.

Ok, so this were some thoughts on how a log analyzer system might work.

The alternative: Google Analytics

I have also considered using Google Analytics for processing the logs for me, but it might not be the best choice since the access logs will contain machine-to-machine communication of XML documents and a-like. Even if I choose to do this, I still need to detect rotated log files, parse them and submit them to GA. Also, there are not many good examples of server side tracking with Google Analytics. I found one with example code, but it focuses on browser based traffic to track users. I can't seem to find a way to track server processing time in a stable manner with Google Analyics, so it's a pretty much no-option for me.

After writing all these words, it seems to me that it might be a little more complicated then initially hoped for. I might have to do some shortcuts. For example can the sending of log files be scheduled using a cron job instead of a daemon agent. I have not worked too much with RRD files yet, so I don't know if it can hold all data I need.

I guess the system might be scalable as well, backed by a horizontally scalable database, asynchronous analyzer jobs and standalone log agents on each server.

First blog entry

The first blog entry. Testing out the blogger tool of Google.

For other posts, visit http://hovenko.no/blog/english-posts/