\documentclass{scrartcl} \setlength{\textheight}{24cm} \usepackage[latin1]{inputenc} \usepackage{color} \usepackage{graphics} \usepackage{psfrag} \usepackage{graphicx} \definecolor{grey}{rgb}{.9,.9,.9} \definecolor{red}{rgb}{1,0,0} \definecolor{olive}{rgb}{.2,.5,.3} \definecolor{CadetBlue}{cmyk}{0.62,0.57,0.23,0} \definecolor{OliveGreen}{cmyk}{0.64,0,0.95,0.40} \usepackage{listings} \lstset{numbers=left, numberstyle=\tiny, numbersep=5pt,basicstyle=\tiny,showstringspaces=false, showtabs=false,tab= ,framexleftmargin=5mm, frame=single, captionpos=b,backgroundcolor=\color{grey}, commentstyle=\color{CadetBlue},boxpos=c,stringstyle=\ttfamily\color{red},keywordstyle=\ttfamily\color{OliveGreen}} \lstset{language=HTML} \title{GRDDLing in XCerpt\\ - Not only Transformations but Answers - \\Developement of Use Cases} \date{} \author{} \begin{document} \maketitle \tableofcontents \newpage \section{General Information} \subsection{Description} The idea is to develope a use case using GRDDL, a mechanism to glean resource descriptions from dialects of languages. Therefore several xhtml files are created, containing hCalendar microformats annotation. These annotations are information that easily can be accessed by software to gain semantic information of the document. Such information could be: \begin{itemize} \item Title of event \item Description of event \item Start of event \item End of event \item Duration of event \item Location of event \item Interval of event \item frequency of event \item Contact information \end{itemize} These hCalendar annotations can be placed almost everywhere in the document. There only has to be a tag with a class attribute containing the specific keyword. \lstset{language=HTML} \begin{lstlisting}[caption=hCalendar microformat]
Description for Event that begins on 18/10/2007 at 09:00 until 10:00
\end{lstlisting} These annotations can appear in every tag in arbitrary depth. So when tranforming these documents, we have to take care of, which annotation belongs to which event. There is also the possibility of events in event. In general such annotations can appear in: \begin{itemize} \item simple tags like $<$span$>$, $<$div$>$, $<$p$>$, $<$h$>$,... \item tags for tables $<$table$>$, $<$tr$>$, $<$td$>$,... \end{itemize} Based on this xhtml files with hCalendar microformats a transformation will be made on the one hand with XSLT and on the other hand with XCerpt.This transformation will build a provisional format called RDF which can be queried by SPARQL as well as XCerpt. Using two different languages for transformation and query may have some disadvantages. Therefore we will use just one language called XCerpt for transformation as well as query to point out the advantages of just one language by comparing XCerpt with XSLT and SPARQL . \begin{figure} \subsection{Resulting RDF-Graph} \centering \includegraphics[width=0.95 \textwidth]{D:/University/XCerpt/GRDDL/graph.jpg} \caption{RDF-Graph} \label{fig:graph} \end{figure} \newpage \section{XSLT as transformation language} \subsection{Idea} There are three possible cases: \begin{enumerate} \item Events that have no ancestors that are events \textit{(li-event-top)} \item Event, that contain subevents \textit{event-in-event and li-event-sub} \item Events, that are in another event \textit{event-last} \end{enumerate} \lstset{language=XSLT} \lstinputlisting[caption=XSLT Stylesheet]{D:/University/XCerpt/GRDDL/EventimEvent/EventinEvent.xsl} \subsection{What doesn't work?} \begin{itemize} \item Regular Expression for duration does work, but in some cases, there will be two $<$duration$>$$<$$/$duration$>$ tags \item URL in Contact-Tag may no be complete, if there is just a relative URL in the source document \item if the user makes two dtstart-Tags with the same date there will be two start Dates in the RDF file \item it is assumed that there always exists a dtstart, if there is a date in the source document. This is necessary to create the Node-ID \item if there is a dtend, the 1 should be subtracted from the endday, because the given day is one day after the Event ends. \item There is no regular expression for YYYY-Www-D \item There was someone who used subdetails and eventlist. Couldn't be found on hCalendar pages \item add more comments \end{itemize} \subsection{Advantages} \begin{itemize} \item Stylesheet is very general. Many different XHTML files with HCalendar microformats can be transformed into RDF with this stylesheet. \item generateId() automatically generate a unique-ID for events \end{itemize} \subsection{Disadvantages} \begin{itemize} \item But the code is very long, because the same templates have to be called several times. \item doesn't really represent the structure of the source document and the goal of the resulting document \item you often have to remember the current node \item therefore a lot of code with descendant and ancestor. Almost in every template part. \end{itemize} \subsection{Tested Sites} \begin{tabular}{|p{7.7cm}|c|c|p{5.3cm}|} \hline \textsc{URL} & \textsc{Trans} & \textsc{RDF} & \textsc{Note}\\ \hline \hline http://jhtc.org/ & NEIN & & Error reported by XML Parser \\ \hline http://finetoothcog.com/site/stolen\_bikes & JA & JA & \\ \hline https://www.urbanbody.com/ information/contact-us & NEIN & &Source File doesn't exist\\ \hline http://www.infoiasi.ro/ & JA & JA & scheint zu funktionieren \\ \hline http://www.crosbyheritage.co.uk/events/ & JA & JA & \\ \hline http://www.newbury-college.ac.uk/ & JA & JA & \\ \hline http://07.pagesd.info/ardeche/agenda.aspx & JA & JA & scheint zu funktionieren\\ \hline http://climbtothestars.org/archives/ 2006/09/14/microformats-et-bloggy-friday-doctobre & NEIN & & Server returned HTTP response code: 403 for URL\\ \hline http://www.westmidlandbirdclub.com/diary/ & JA & & Error Detected by XML Parser\\ \hline http://www.comtec-ars.com/press-releases/ & JA & JA & \\ \hline http://webdirections.org/program/ & JA & JA & Date scheint manchmal zu fehlen Bsp. bei Lunch \\ \hline http://www.thestreet.org.au/whats\_on.htm & NEIN & NEIN &Error: Invalid Byte \\ \hline http://www.clacksweb.org.uk/community/ events/ & JA & JA & verwendet class="subdetails" \\ \hline http://www.markthisdate.com/ & NEIN & NEIN &|Error detected by XML Parser \\ \hline http://www.gustavus.edu/events/nobel conference2006/schedule.cfm & JA & JA & \\ \hline http://www.geekinthepark.co.uk/ & NEIN & NEIN &Error reported by XML Parser \\ \hline http://www.besancon.fr/ & NEIN & NEIN &Error reported by XML Parser \\ \hline http://2006.dconstruct.org/schedule/ & JA & JA & \\ \hline http://www.fuckparade.org/flyer/2006/ & JA & &Funktioniert wenn man andere Doctype einfuegt \\ \hline http://www.harper-adams.ac.uk/press/events.cfm & NEIN & &Error Reported by XML Parser \\ \hline http://www.capital.edu/ & JA & JA & \\ \hline http://www.thesession.org/events/ & JA & JA & \\ \hline http://rubyandrails.org/usergroups/newcastle & JA & JA &\\ \hline http://gross.org.za/calendar & JA & &Funktioniert wenn link Tag beendet wird und man andere Doctype verwendet \\ \hline http://www.webanalyticsassociation.org/en/ calendarevents/search.asp & NEIN & &Error reported by XML Parser \\ \hline \end{tabular} XSLT-Stylesheet seem to work well. Most errors that occur are caused because of the document and not because of the stylesheet. Hence, the creator of the documents have to pay more attention. \newpage \section{XCerpt as transformation language} The Idea is to use XCerpt instead of XSLT. Unfortunately, at the moment, it is not possible to make a general stylesheet, because of several reasons, that will be explained later. As example there is a source file that should be transformed into RDF. This file is shown in the following. \lstinputlisting[caption=XCerpt Example]{D:/University/XCerpt/GRDDL/XCerpt/TermineBing.html} To transform the file shown above into RDF the following stylsheet is used. \lstset{language=XML} \lstinputlisting[caption=XCerpt Example]{D:/University/XCerpt/GRDDL/XCerpt/TermineBing.xcerpt} \subsection{Advantages} \begin{itemize} \item close to the structure of the source document \item code is easier to understand \item creation of RDF is also very close to the intended RDF Document \end{itemize} \subsection{Disadvantages} \begin{itemize} \item A general Stylesheet is not possible. Problem is especially caused by the thing with the surjectivity. \item Problems with Events containing other events. At the moment it is only possible to find events that maybe are part of another event but themselve do not contain any event. \item If the XCerpt program is very long and contains many terms with the form: \\ \begin{lstlisting}[caption=Optional] optional desc /.*/{{ attributes{class{"location"}}, var Location }} \end{lstlisting} the time needed to create the result is very long. Too many Optionals? \item Namespaces have been removed in the XHTML file, because they do not work in XCerpt \item the title of an event is used as uniqueID. But what to do if there is no title? \end{itemize} \newpage \section{SPARQL for query} \subsection{Possible Queries} SPARQL is used to formulate several queries on the produced RDF files. Therefore some possible queries are: \begin{enumerate} \item Which Events are attended by X at what Date on what Time? \item Which Events are attended by X as well as Y? \item Is an appointment possible on a specific date at a specific time? \item Same query is 3. but with different startday end endday \end{enumerate} \subsection{Queries in SPARQL} \lstset{language=XML} \lstinputlisting[caption=SPARQL Query 1]{D:/University/XCerpt/GRDDL/Schedules/SRobert.rq} \lstinputlisting[caption=SPARQL Query 2]{D:/University/XCerpt/GRDDL/Schedules/Together.rq} \lstinputlisting[caption=SPARQL Query 3]{D:/University/XCerpt/GRDDL/Schedules/STamara.rq} \subsection{Advantages} \begin{itemize} \item A query with ASK is possible, that only a Yes or No is returned as answer \item in general short queries \end{itemize} \subsection{Disadvatages} \begin{itemize} \item executing the query returns the correct results. Nevertheless there are some error messages in the command line \item Listing 8 not sure whether program is correct \item Listing 7 Redundancy... almost the same code is used twice. \item it is confusing to work with RDF/XML representation, but to make the query on triples like Subject-Predicate-Object. \end{itemize} \section{XCerpt for query} As our goal is to use just on language to transfor and query instead of two, the mentioned queries above also can be formulated in XCerpt. The programs will look like the following ones. \subsection{Queries in XCerpt} \lstset{language=XML} \lstinputlisting[caption=XCerpt Query 1]{D:/University/XCerpt/GRDDL/Schedules/SRobert.xcerpt} \lstinputlisting[caption=XCerpt Query 2]{D:/University/XCerpt/GRDDL/Schedules/Together.xcerpt} \begin{lstlisting}[caption=XCerpt Query 3] This query unfortunately at the moment is not possible in XCerpt, because there are some problems with the WHERE part. \end{lstlisting} \subsection{Advantages} \begin{itemize} \item Instead of using FILTER to ensure that var A = var B, in XCerpt it is possible to use the same variable. \end{itemize} \subsection{Disadvatages} \begin{itemize} \item The programs look much longer than the SPARQL queries. \item All Namespace in the RDF File have been removed. \end{itemize} \end{document}