XLWrap – Mapping Design Patterns

Back to main page

Index

Based on feedback from users we are collecting different use cases and sample mappings on this page. This is work-in-progress.

  1. Mapping flat tables
  2. Mapping multi-dimensional cross tables to RDF based on SCOVO
  3. Mapping hierarchies
  4. Mapping a collection of spreadsheets with external cell references
  5. Mapping to DBpedia using DBP_* functions
  6. ...more to come...

Mapping flat tables

Wrapping flat tables is actually easy. A typical use case is to create one RDF instance per row which can be done by a single template with a single row or column shift operation. Consider the following table:

  A B C
1 First name Second name E-Mail address
2 Tom Houston th@ex.com
3 Tim Presley jp@ex.com
4 ... ... ...
{ [] a xl:Mapping ;
    xl:template [
       xl:fileName "file:employees.xls" ;
       xl:templateGraph :Persons ;
       xl:transform [ a xl:RowShift ]
    ] .
}

:Persons {
    [ xl:uri "'http://example.org/' & URLENCODE(A2 & B2)"^^xl:Expr ] a foaf:Person ;
    foaf:name "A2 & ' ' & B2"^^xl:Expr ;
    foaf:mbox_sha1sum "SHA(C2)"^^xl:Expr ;
}

The example above shows only the minimal required mapping elements. The default sheet is the first sheet of the specified spreadsheet. Instead of specifying multiple transform operations in a RDF sequence, it is allowed to specify a single transformation without rdf:Seq. The default transform step of a row or column shift is 1 row/column, the default xl:breakCondition is always false and the default xl:restrict is *.* (see mapping specification section of the documentation).

Range references and custom functions

Range references as part of the template graph can refer to any cell of the sheet and also to other possibly external files and sheets. Additional functions can be easily added as required.

Mapping Multi-dimensional cross tables

In order to wrap multi-dimensional cross tables, you first need to carefully inspect your spreadsheet. In particular, you will need to know:

You should then decide about the structure of the output graph which should best reflect the information model contained in the spreadsheet. The target graph and the way how information is represented in the spreadsheet will determine the mapping specification.

We suggest to first model the desired output graph and then specify the required transformation operations.

1. Modeling the desired output graph

A typical use case is one instance per cell with properties for all the different dimensions. Consider the following simple cross table which represents revenues and sold items of a company per year, country, and product:


Sheet name in workbook: Austria (note that for CSV files there are no sheet names, just one sheet with sheet number 0).

  A B C D E F G H
1 Austria              
2   2007   2008   2009   ...
3 product items sold revenue items sold revenue items sold revenue ...
4 Product 1 342 € 7.866,00 376 € 8.648,00 490 € 11.760,00 ...
5 Product 2 4333 € 1.005.256,00 5655 € 1.328.925,00 3493 € 838.320,00 ...
6 ... ... ... ... ... ... ... ...

A simple RDF model accurately reflecting this information would be a graph with instances like the following:

Simple example output graph:

  ex:revenue_Austria_2007_Product1
    a ex:Revenue ;
    ex:country "Austria" ;
    ex:year "2007" ;
    ex:product "Product x" ;
    ex:revenue "7866"^^<http://www.w3.org/2001/XMLSchema#int> .

For each dimension, you will create a transform operation which shifts parts of the base template graph around in your sheet, workbook, or even over multiple files as we will show later.

But consider you want to map the above table to a RDF graph based on SCOVO, the Statistical Core Vocabulary. Again, we start by designing the target graph. Based on SCOVO, the graph should look like this:

SCOVO output graph:

ex:theDataset a scv:Dataset ;
    dc:title "Revenues of Example, Inc. per year, country, and products" ;
    dc:creator "you..." ;
    dc:date "2009-06-12 19:34:12Z"^^xsd:dateTime .

ex:Revenue a scv:Item .
ex:Country [ rdfs:label "Country" ; rdfs:subClassOf scv:Dimension ] .
ex:Year [ rdfs:label "Year" ; rdfs:subClassOf scv:Dimension ] .
ex:Product [ rdfs:label "Product" ; rdfs:subClassOf scv:Dimension ] .

ex:Revenue_Austria_2007_Product1
    a ex:Revenue ;
    scv:dataset ex:theDataset ;
    scv:dimension ex:Austria ;
    scv:dimension ex:2007 ;
    scv:dimension ex:Product1 ;
    rdf:value "7866"^^<http://www.w3.org/2001/XMLSchema#int> .
ex:Revenue_Austria_2007_Product2
    a ex:Revenue ;
    scv:dataset ex:theDataset ;
    scv:dimension ex:Austria ;
    scv:dimension ex:2007 ;
    scv:dimension ex:Product2 ; # after first row shift
    rdf:value "1005256"^^<http://www.w3.org/2001/XMLSchema#int> .
...
ex:Austria a ex:Country ; rdfs:label "Austria" ; owl:sameAs <http://dbpedia.org/resource/Austria> .
ex:2007 a ex:Year ; rdfs:label "2007" ; owl:sameAs <http://dbpedia.org/resource/2007> .
ex:Product1 a ex:Product ; rdfs:label "Product1" .
ex:Product2 a ex:Product ; rdfs:label "Product2" .
...

In order to model the required meta data, such as the scv:Dataset, the ex:Revenue, and the three SCOVO dimensions, we will need a constant graph (property xl:constantGraph) and one template graph which generates the instances of ex:Revenue, which is a scv:Item.

The table again:

Sheet name in workbook: Austria (note that for CSV files there are no sheet names, just one sheet with sheet number 0).

  A B C D E F G H
1 Austria              
2   2007   2008   2009   ...
3 product items sold revenue items sold revenue items sold revenue ...
4 Product 1 342 € 7.866,00 376 € 8.648,00 490 € 11.760,00 ...
5 Product 2 4333 € 1.005.256,00 5655 € 1.328.925,00 3493 € 838.320,00 ...
6 ... ... ... ... ... ... ... ...

A valid SCOVO mapping could look like this:

SCOVO constant graph part:

<> owl:import scv: .		# import the SCOVO vocabulary

[ xl:uri "MAKEURI('theDataset')"^^xl:Expr ] a scv:Dataset ;
    dc:title "Revenues of Example, Inc. per year, country, and products" ;
    dc:creator "you" ;
    dc:date "NOW()"^^xl:Expr .

ex:Revenue a scv:Item .
ex:Country [ rdfs:label "Country" ; rdfs:subClassOf scv:Dimension ] .
ex:Year [ rdfs:label "Year" ; rdfs:subClassOf scv:Dimension ] .
ex:Product [ rdfs:label "Product" ; rdfs:subClassOf scv:Dimension ] .

SCOVO template graph:

[ xl:uri "MAKEURI('Revenue_' & SHEETNAME(A1) & '_' ∓ B2 & '_' & A4)"^^xl:Expr ] a ex:Revenue ;
    scv:dataset [ xl:uri "MAKEURI('theDataset')"^^xl:Expr ] ;
    scv:dimension [ xl:uri "MAKEURI(SHEETNAME(A1))"^^xl:Expr ; a ex:Country ; rdfs:label "A1"^^xl:Expr ; owl:sameAs "DBP_COUNTRY(A1)"^^xl:Expr ] ;
    scv:dimension [ xl:uri "MAKEURI(B2)"^^xl:Expr ; a ex:Year ; rdfs:label "B2"^^xl:Expr ; owl:sameAs "DBP_YEAR(B2)"^^xl:Expr ] ;
    scv:dimension [ xl:uri "MAKEURI(A4)"^^xl:Expr ; a ex:Product ; rdfs:label "A4"^^xl:Expr ] ;
    rdf:value "B4"^^xl:Expr .
}

As explained in the documentation, it is possible to create URI nodes or to ensure that blank nodes get equal labels in the target graph using xl:uri and xl:id. Because each revenue should be linked with the same SCOVO dimension instance in the output graph, this mechanisms is applied in the example above. Likewise, all revenue instances are linked to the same SCOVO dataset identified by ex:theDataset.

This template graph without any transform operations would be applied just once leading to only one instance of ex:Revenue. In order to apply it multiple times while shifting around parts of the range references, we need to specify transform operations.

Adding a mapping for the sold items

If we also want to map the number of sold items, which has equal dimensions we can do this by simple adding the statement:

	ex:ItemsSold    a scv:Item .

to the constant graph and the following to the template graph:

[ xl:uri "MAKEURI('ItemsSold_' & SHEETNAME(A1) & '_' ∓ B2 & '_' & A4)"^^xl:Expr ] a ex:ItemsSold ;
    scv:dataset [ xl:uri "MAKEURI('theDataset')"^^xl:Expr ] ;
    scv:dimension [ xl:uri "MAKEURI(SHEETNAME(A1))"^^xl:Expr ; a ex:Country ; rdfs:label "A1"^^xl:Expr ; owl:sameAs "DBP_COUNTRY(A1)"^^xl:Expr ] ;
    scv:dimension [ xl:uri "MAKEURI(B2)"^^xl:Expr ; a ex:Year ; rdfs:label "B2"^^xl:Expr ; owl:sameAs "DBP_YEAR(B2)"^^xl:Expr ] ;
    scv:dimension [ xl:uri "MAKEURI(A4)"^^xl:Expr ; a ex:Product ; rdfs:label "A4"^^xl:Expr ] ;
    rdf:value "B4"^^xl:Expr .
}

We will show the complete mapping below.

2. Modeling the transformations

There are often multiple ways how to shift and repeat template graphs. For our example, we first use a row shift to apply the template for all the products. This involves shifting down the range references A4 and C4 as long as the condition LEFT(A4, 7) != 'Product'" holds. Another option would be to specify the number of repeat iterations (xl:repeat "10"). However, since different sub-tables may have different numbers of rows, using conditions is often better.

The second transform is a column shift, which shifts B2 and C4 by 2 columns to the right as long as C4 is not empty.

xl:transform [
    a rdf:Seq ;
    rdf:_1 [
       a xl:RowShift ;
       xl:restriction "A4; C4" ;
       xl:breakCondition "LEFT(A4, 7) != 'Product'" ;
       xl:steps "1" ;
    ] ;
    rdf:_2 [
       a xl:ColShift ;
       xl:restriction "B2; C4"^^xl:Expr ;
       xl:breakCondition "ALLEMPTY(C4)" ;
       xl:steps "2" ;
    ] ;
]

If we had another sheet with similar data, we could add another transform operation to shift and apply the template graph to them:

   	rdf:_3 [
       a xl:SheetShift ;
       xl:restriction "#1.*"^^xl:Expr ;
       xl:repeat "2"
    ] ;

Here is the complete mapping specification of our example:

The mapping file contains an (anonymous) instance of xl:Mapping in the default graph defining the example spreadsheet. There is one map template, which generates a SCOVO graph for the revenue cross table. The base workbook file (xl:fileName) and the sheet number (xl:sheetNumber) are specified first. A constant graph called :SCOVODataset is specified, which generates the necessary SCOVO dataset, SCOVO dimensions, and some DC metadata. It contains XLWrap expressions, as for example NOW(), which returns the current time when the graph is evaluated.

The template graph :SCOVORevenues is repeatedly applied producing multiple SCOVO items for each country, year, and product. In addition to ex:Revenue as explained above, it also contains ex:ItemsSold which refers to B4. There are three transform operations specified as an RDF Sequence:

  1. a 1-step RowShift until LEFT(A4, 7) == 'Product' becomes false,
  2. a 2-step ColShift until !ALLEMPTY(B4:C4) becomes false, and
  3. a 1-step SheetShift where the template is applied two times (one time for the initial sheet included).

The role of the range restrictions (xl:restriction) is very important: it restricts a transform operation to a defined range. For example, given the first RowShift, only range references in A4; B4:C4 will be transformed, but A1 (the country) and B2 (the year) are not transformed. They will be shifted later by the other transform operations.

Example Query

The following query can be used to query the generated SCOVO statistics:

PREFIX rdf: 
PREFIX rdfs: 
PREFIX owl: 
PREFIX scv: 
PREFIX ex: 
SELECT ?country ?year ?product ?type ?value WHERE {
      ?s a ?type ;
      scv:dimension [ a ex:Year ; owl:sameAs ?year ] ;
      scv:dimension [ a ex:Country ; owl:sameAs ?country ] ;
      scv:dimension [ a ex:Product ; rdfs:label ?product ] ;
      rdf:value ?value
} ORDER BY ?country ?year ?product
SPARQL Query Results

Mapping hierarchies

To demonstrate the power of XLWrap we will show another sample use case which generates a RDFS sub-class hierarchy based on a hierarchical information model represented in a flat spreadsheet table. Let's consider the following table data:

  A B C D E
1 Corporate EMEA Company Germany Sales Germany  
2 Corporate EMEA Company Germany Marketing Germany  
3 Corporate Americas Company USA Sales USA  
4 Corporate Americas Company USA Marketing USA  
5 Corporate Shared Services HR    
6 Corporate Shared Services IT    

The table represents a horizontally aligned hierarchy of a company organization. The first column contains the root of the hierarchy and all the subsequent cells from left to right are sub-units of the units in preceding row cells. We would like to generate a RDFS sub-class hierarchy of the units.

We will use two templates:

  1. one for the first column generating all root units (probably just one in fact, but there could be more forming a multi-tree)
  2. one for all the sub-units

The first template is very simple. We apply a single row shift to generate all units row-by-row. Since equal statements will be ignored in the target graph, we don't care about generating the statements for equal units multiple times:

{ [] a xl:Mapping; 
    xl:template [
       xl:fileName "files/testing/hierarchy-example.xls" ;
       xl:templateGraph :FirstCol ;
       xl:transform [
          a xl:RowShift ;
          xl:breakCondition "ALLEMPTY(*.*)"
       ]
    ] .
}

:FirstCol {
    [ xl:uri "URI('http://example.com/' & URLENCODE(A1))"^^xl:Expr ] a rdfs:Class ; rdfs:label "A1"^^xl:Expr .
}

The row shift is applied as long as the referenced cells are not empty (condition ALLEMPTY(*.*)).

The second template is very similar to the first one, it starts with B1 and adds the corresponding rdfs:subClassOf relationship by referring to the previous cell in the same row. Additionally, before the template is applied to the next row using a row shift, a column shift is inserted which generates the hierarchy path as long as the shifted cell B1 is not empty. If the sheet bounds are reached a transform stage is skipped and the processor continues with the next one. But because some hierarchical paths may be shorter than others, it is important to manually check for empty cells using a xl:breakCondition.

{ [] a xl:Mapping;
   ... first template as above ...

    xl:template [
       xl:fileName "files/testing/hierarchy-example.xls" ;
       xl:templateGraph :OtherCols ;
       xl:transform [ a rdf:Seq ;
          rdf:_1 [ a xl:ColShift ; xl:breakCondition "ALLEMPTY(B1)" ] ;
          rdf:_2 [ a xl:RowShift ]
       ]
    ] .

:OtherCols {
    [ xl:uri "URI('http://example.com/' & URLENCODE(B1))"^^xl:Expr ] a rdfs:Class ;
       rdfs:subClassOf "URI('http://example.com/' & URLENCODE(A1))"^^xl:Expr .
}

Here is the generated output in N3.

Mapping a collection of spreadsheets

tbd

Mapping to DBpedia

tdb