TripliSty - an rdf triplifier for statistical data

Early Draft - 2009/07/12

Author:
Giovanni Pirrotta (Department of Mathematics, University of Messina, Italy)

Introduction

TripliSty tool has been created in order to automate mapping between SCOVO ontology and statistical data. To work correctly, TripliSty tool needs a mapping file defined following the SCOMA (SCOma MApping) ontology definition language. See below
I will try to explain TripliSty tool with an example. In the following table, the values represent the number of tourists in some cities in the 2008 (sample data)


In statistics context the data to focus our attention on is not the row but the single cell value and its dimensions.
In each statistic table we can have three types of dimensions:

In our example:

The row dimension represents the dimension key that identifies each single row in the table. Different rows in tables will have different row dimensions applied to the cells.
What I want to emphasized is that, in statistical data, the main concept is not each single row but each single cell object

The SCOVO ontology

A very important ontology we can use to model statistical data is of course SCOVO ontology since all properties and concepts turn around the cell object. In the following figure you can note the cell-centric (or item-centric) vision of Scovo ontology.


In statistical data each single cell becomes an instance of Item and our task is to assign to each single cell the correct value, using the "value" property, and all dimensions that identify the single cell in the table.

Our above example in SCOVO ontology will be:

 @prefix ex: <http://example.org/tourists/> . 
 @prefix scv: <http://purl.org/NET/scovo#> . 
 @prefix dc: <http://purl.org/dc/elements/1.1/> . 
 @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . 
 @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> . 
 @prefix xsd: <http://www.w3.org/2001/XMLSchema#> . 
 
# domain schema definitions 
ex:Year    rdfs:subClassOf scv:Dimension ;
 		 dc:title "year" ;
         .

ex:City   rdfs:subClassOf scv:Dimension ; 
        dc:title "A city"; 
        .
		
ex:Genre   rdfs:subClassOf scv:Dimension ; 
         dc:title "genre" ;
		 .
		
ex:2008 rdf:type ex:Year ; 
        dc:title "The year 2008" ; 
        scv:min "2008-01-01"^^xsd:date ; 
        scv:max "2008-12-31"^^xsd:date ;
		. 

ex:rome    rdf:type    ex:City;
           dc:title    "City of Rome".

ex:paris   rdf:type    ex:City;
           dc:title    "City of Paris".

ex:london  rdf:type    ex:City;
           dc:title    "City of London".

ex:berlin  rdf:type    ex:City;
           dc:title    "City of Berlin".



ex:male    rdf:type   ex:Genre;
		     dc:title   "male".

ex:female  rdf:type   ex:Genre;
		     dc:title   "female".

ex:tourists       rdf:type scv:Dataset ;
                  dc:title "Tourists in the 2008" ; 
                  scv:datasetOf ex:2008-rome-m;
                  scv:datasetOf ex:2008-rome-f;
                  scv:datasetOf ex:2008-paris-m;
                  scv:datasetOf ex:2008-paris-f;
                  scv:datasetOf ex:2008-london-m;
                  scv:datasetOf ex:2008-london-f;
                  scv:datasetOf ex:2008-berlin-m;
                  scv:datasetOf ex:2008-berlin-f.
 
			
ex:2008-rome-m   rdf:type      scv:Item ; 
                               rdf:value "1324" ; 
                               scv:dataset ex:tourists ; 
                               scv:dimension ex:rome ; 
                               scv:dimension ex:male ; 
                               scv:dimension ex:2008 ;
							   .

ex:2008-rome-f   rdf:type      scv:Item ; 
                               rdf:value "1432" ; 
                               scv:dataset ex:tourists ; 
                               scv:dimension ex:rome ; 
                               scv:dimension ex:female ; 
                               scv:dimension ex:2008 ;
							   .

ex:2008-paris-m  rdf:type      scv:Item ; 
                               rdf:value "2432" ; 
                               scv:dataset ex:tourists ; 
                               scv:dimension ex:paris ; 
                               scv:dimension ex:male ; 
                               scv:dimension ex:2008 ;
							   .

ex:2008-paris-f  rdf:type      scv:Item ; 
                               rdf:value "2654" ; 
                               scv:dataset ex:tourists ; 
                               scv:dimension ex:paris ; 
                               scv:dimension ex:female ; 
                               scv:dimension ex:2008 ;
							   .

								 
ex:2008-london-m  rdf:type      scv:Item ; 
                                rdf:value "4532" ; 
                                scv:dataset ex:tourists ; 
                                scv:dimension ex:london ; 
                                scv:dimension ex:male ; 
                                scv:dimension ex:2008 ;
								.

ex:2008-london-f  rdf:type      scv:Item ; 
                                rdf:value "4943" ; 
                                scv:dataset ex:tourists ; 
                                scv:dimension ex:london ; 
                                scv:dimension ex:female ; 
                                scv:dimension ex:2008 ;
								.

ex:2008-berlin-m  rdf:type      scv:Item ; 
                                rdf:value "2354" ; 
                                scv:dataset ex:tourists ; 
                                scv:dimension ex:berlin ; 
                                scv:dimension ex:male ; 
                                scv:dimension ex:2008 ;
								.
	
ex:2008-berlin-f  rdf:type      scv:Item ; 
                                rdf:value "2534" ; 
                                scv:dataset ex:tourists ; 
                                scv:dimension ex:berlin ; 
                                scv:dimension ex:female ; 
                                scv:dimension ex:2008 ;
								.

As we can see each item has a unique URI in order to be identified inside and outside the table. We can make URI as we want but a good way is to assemble dimension pieces.
For a moment we focus our attention on the female tourists of Berlin. We note that the URI of this item is composed by some pieces obtained from uri home, dataset and dimensions.
We can see that in the following figure:

As we can see, each item can be generate simply assembling the respective infomation pieces. The target of this tutorial is exactly to automate the whole procedure with the TripliSty tool.

The TripliSty tool

The TripliSty tool automates the mapping procedure of statistics in semantic way and provides a set of features to publish on-the-fly the results following the LOD rules. To find a strategy to map an input source (database, xls file, csv file) with Scovo Ontology in order to define the dimensions to apply to each item, the SCOMA ontology (SCOvo Mapping) has been defined.
TripliSty tool takes as input file a scoma mapping file and gives as results the statistics rdfized as specified in the mapping file.

To explain how TripliSty works, the first step to follow is to understand how to write the SCOMA mapping file.
Imagine the above data located in a database table called "tourists".
First of all we have to decide the structure of our URIs. So we have:

Obviously "localhost" domain must be modified with your "real" domain. In this context we have chosen "localhost" domain in order to publish the final results on our local pc.

Each SCOMA mapping file must have a ScomaConfig object.

# Definition of the class "ScomaConfig"
map:scomacfg   a                    scm:ScomaConfig;
scm:hasDataset map:mydataset;
scm:uriHome "http://localhost/triplisty_example/";
scm:uriSchemaHome "http://localhost/triplisty_example/schema/";
scm:uriDimensionHome "http://localhost/triplisty_example/dimension/";
scm:uriDatasetHome "http://localhost/triplisty_example/dataset/"; .

A mapping file can define one ore more datasets. In our example we consider for semplicity only one dataset referenced by "mydataset" resource.

# Definition of the class "Dataset"

map:mydataset  a                    scm:Dataset;			      
               scm:sourceType       "db";               
               scm:hasSource        map:mysource;
               scm:hasKey           map:key;               
               scm:hasField         map:field1;               
               scm:hasField         map:field2;
               scm:uriDataset       "http://localhost/triplisty_example/dataset/tourists";
               scm:title            "Dataset: tourists";
               scm:label            "Dataset: tourists";
               scm:hasDimension     map:dim_2008;
			      .

For each dataset we must define a source type, in our case "db" (other values are "xls" and "csv"). In dataset object we define the structure of our dataset. It is composed by a key and two fields. In our example the key is the city column and the fields are the male/female columns. After we defined the URI Dataset, the title and label dataset, we specify the unique dataset dimension.
Now we are going to explore how the datasource resource is composed.

# Definition of the class "Source"

map:mysource   a                    scm:Database;
               scm:dbType           "mysql";
               scm:dbName           "scoma";
               scm:hostname         "localhost";
               scm:table            "tourists";
               scm:username         "root";
               scm:password         "mypassword":
               .

We must define the database type (default "mysql" - available also "postgres"), the name DB, the hostname, the user, the password and obviously the table name. We can also define other parameters to obtain more flexibility, for example join from tables, size limit, and so forth. See SCOMA Definition Language

 

# Definition of the class "Key"

map:key        a                     scm:Key;               
               scm:column            "city";
               scm:hasDimension      map:dim_city;                        
               scm:order             "1";        
               .
			   
			   
map:dim_city   a                          scm:DimensionInstance;
               scm:hasDimensionType       :City;
               scm:uriDimensionInstance   "http://localhost/triplisty_example/dimension/city/@@city@@";
               scm:uriPiece               "-@@city@@";
               scm:index                  "1";
               scm:titlePiece             ", city: @@city@@"; 
               scm:labelPiece             ", city: @@city@@"; 
               .

The Key class is perhaps the most interesting class. Each dataset can have one ore more keys and each key must have a column associated. In our example we define "city" column of our table. As we said, the key represents our dimension row. Unlike dataset and column dimension, that can be defined simply in scoma mapping file, all row dimensions can be a very large number (one row dimension for each row). Since we do not want define all row dimensions of the table (it could be thousand) we could define a URI pattern in order to indicate to TripliSty, how it must compose dimension row for us. (see uri property of dim_city resource). The same principle is valid also to compose the uri item (in fact in each uri item we find a piece of dimension row) so we must define the uri piece item that will change to each row. To do this we define a URI pattern as you can see in the code.
In the URI pattern we must define the changing part between "@@" and indicate an index property to ensure the correct composing in the presence of multiple keys. Obviously also the title and label must change to each row so here we define the text to appear with the pattern strategy. The order property in the Key resource defines the position of the field in the table

# Definition of the class "Field"

map:field1     a                    scm:Field;
               scm:hasDimension     map:dim_male;
               scm:datatype         "integer";
               scm:order            "2";
               scm:column           "male";
               .
               

map:field2     a                    scm:Field;
               scm:hasDimension     map:dim_female;
               scm:datatype         "integer";
               scm:order            "3";
               scm:column           "female";
               .
			  			   

map:dim_male   a                         scm:DimensionInstance;
               scm:hasDimensionType      :Genre;
               scm:uriDimensionInstance  "http://localhost/triplisty_example/dimension/genre/male";
               scm:uriPiece              "-m";
               scm:index                 "1";
               scm:titlePiece            ", genre: male";
               scm:labelPiece            ", genre: male";
               .

map:dim_female a                         scm:DimensionInstance;
               scm:hasDimensionType      :Genre;
               scm:uriDimensionInstance  "http://localhost/triplisty_example/dimension/genre/female";
               scm:uriPiece              "-f";
               scm:index                 "1";
               scm:titlePiece            ", genre: female";
               scm:labelPiece            ", genre: female";
               .

map:dim_2008   a                         scm:DimensionInstance;
               scm:hasDimensionType      :Year;
               scm:uriDimensionInstance  "http://localhost/triplisty_example/dimension/year/2008";
               scm:uriPiece              "/2008";
               scm:index                 "1";
               scm:titlePiece            ", year: 2008";
               scm:labelPiece            ", year: 2008";
               .

 

In this code you can see how to map the column name with one or more dimension resources. For each field you can define (not mandatory) a datatype for the value in the column and the order to ensure the correct position of the field in the table (useful when you have more than one dimension in the same column).
To each dimension associated to a column we must define a DimensionInstance resource with the same rule seen above for Key class. This time the uri, title and label are static.

 

# Domain schema definitions"
:Year   rdfs:subClassOf scv:Dimension ;
 	    dc:title "Year" . 

:City   rdfs:subClassOf scv:Dimension ; 
        dc:title "A city" . 

:Genre  rdfs:subClassOf scv:Dimension ; 
        dc:title "Genre" .
		   

At the end we must define the dimension types, simply extending SCOVO ontology, as we have seen above.

TripliSty in action

Requirements

Stand-alone mode

# By command-line
python -m triplisty_console -m mapping-file [-r request] [-D] [-f file_to_save]

# To generate the whole table
python -m triplisty_console -m c://triplisty_example/tourists.n3 -r http://localhost/triplisty_example/dataset/tourists

# To generate the the single item
python -m triplisty_console -m c://triplisty_example/tourists.n3 -r http://localhost/triplisty_example/dataset/tourists/2008-Berlin-m

# To dump all datasets in mapping file 
python -m triplisty_console -m c://triplisty_example/tourists.n3 -D

# To save the output in a specified file 
python -m triplisty_console -m c://triplisty_example/tourists.n3 -r http://localhost/triplisty_example/dataset/tourists -f myfile.rdf


Publish statistics on the WEB

To publish statistics on the Web with TripliSty, a lod.py python script has been created as follows:

#!/usr/bin/python

import cgi
from triplisty.engine import Engine

print "Content-type: application/rdf+xml"
print

query = cgi.FieldStorage()
uri = query.getvalue("uri")
            
engine =  Engine()
engine.set_mapping_file("c://triplisty_example/tourists.n3")
engine.set_request(uri)
engine.run()

Inside triplisty_example directory an .htaccess file has been prepared as follows:
	

RewriteEngine On
RewriteBase /triplisty_example


RewriteRule ^dataset/(.+) lod.py?uri=http://localhost/triplisty_example/dataset/$1 
RewriteRule ^dimension/(.+) lod.py?uri=http://localhost/triplisty_example/dimension/$1 
RewriteRule ^schema(.*) lod.py?uri=http://localhost/triplisty_example/schema$1 

That's all !!!

Now you can navigate through your statistics data for example with Tabulator Firefox plugin.

The whole table:

 

 

The single Item:

 

A static dimension

A dynamic dimension

A dimension type (dimension class)

All triples are generated on-the-fly so the data are stored only on the database (or xls/csv files).

There are many other aspects and features you can specify in scovo mapping file. For example:

Scoma Definition Language

ScomaConfig

A scm:ScomaConfig defines all the datasets to map and the informations about the URIs schema.

Properties

scm:hasDataset The dataset(s) to map. (Mandatory)
scm:uriHome Defines the URI Home. (Mandatory) Example: http://example.org/myproject
scm:uriSchemaHome Defines the URI Schema Home. (Mandatory) Example: http://example.org/myproject/schema
scm:uriDimensionHome Defines the URI Dimension Home. (Mandatory) Example: http://example.org/myproject/dimension
scm:uriDatasetHome Defines the URI Dataset Home. (Mandatory) Example: http://example.org/myproject/dataset

Example:
# Definition of the class "ScomaConfig"
map:scomacfg   a                    scm:ScomaConfig;               
               scm:hasDataset       map:mydataset;
               scm:uriHome          "http://localhost/triplisty_example/";
               scm:uriSchemaHome    "http://localhost/triplisty_example/schema/";
               scm:uriDimensionHome "http://localhost/triplisty_example/dimension/";
               scm:uriDatasetHome   "http://localhost/triplisty_example/dataset/"; 
               .

Dataset

A scm:Dataset defines the schema of the dataset to map.

Properties

scm:sourceType Defines the input source type. Availables: db, xls, csv. (Mandatory)
scm:hasSource Defines the input source. (Mandatory)
scm:hasKey Defines the schema key(s) column (Mandatory)
scm:hasField Defines the schema field(s) column in the schema. (Mandatory)
scm:uriDataset Defines the Dataset URI (Mandatory)
scm:title Defines the Dataset title (optional)
scm:label Defines the Dataset label (optional)
scm:hasDimension Defines the dataset dimension(s). This dimension will be applied to each dataset item .
scm:sizeDataset Defines a limit for the dataset size. Useful for dataset very large. (To implement)
scm:externalTriples Defines the function name to invoke in the Item generating procedure. The function must be present in triplisty.functions module and you have to use it if you want to generate triples to add on-the-fly to the dataset (ie. link with external dataset, geonames, dbpedia, etc.). See DimensionInstance example below

Example:
# Definition of the class "Dataset"

map:mydataset  a                    scm:Dataset;			      
               scm:sourceType       "db";               
               scm:hasSource        map:mysource;
               scm:hasKey           map:key;               
               scm:hasField         map:field1;               
               scm:hasField         map:field2;
               scm:uriDataset       "http://localhost/triplisty_example/dataset/tourists";
               scm:title            "Dataset: tourists";
               scm:label            "Dataset: tourists";
               scm:hasDimension     map:dim_2008;
               .

Database

A scm:Database defines the database input source

Properties

scm:dbType Defines the type of database. Availables: mysql, postgres. (Mandatory)
scm:dbName Defines the database name. (Mandatory)
scm:hostname Defines the hostname (Mandatory)
scm:username Defines the username. (Mandatory)
scm:password Defines the password (Mandatory)
scm:table Defines the table(s) (Mandatory)
scm:condition Defines the condition in the WHERE clause (Optional)
scm:join Defines the join between tables (Optional)
scm:extra Defines an extra string to add in the query. (Example: LIMIT...ORDER BY....DESC...)

Example:
# Definition of the class "Database"

map:mysource   a                    scm:Database;
               scm:dbType           "mysql";
               scm:dbName           "scoma";
               scm:hostname         "localhost";
               scm:table            "tourists";
               scm:username         "root";
               scm:password         "mypassword":
               .

XLSFile

A scm:XLSFile defines the XLS file input source

Properties

scm:pathfile Defines the path of the file to load. (Mandatory)
scm:indexSheet Defines the index sheet to select into the xls file. (Optional - default: 0)
scm:nameSheet Defines the name sheet to select into the xls file (Optional)
scm:startRow Defines the first row. (Optional - default: 1)
scm:endRow Defines the last row (Optional)

Example:
# Definition of the class "XLSFile"
map:mysource   a                scm:XLSFile;
               scm:pathfile         "C://scoma/table.xls";
               scm:indexSheet       "0"; # default 0
               # or scm:nameSheet   "sheet1"; (alternative to indexSheet)
               scm:startRow         "1"; #default 1
               scm:endRow           "10";
               .	

CSVFile

A scm:CSVFile defines the CSV file input source

Properties

scm:pathfile Defines the path of the file to load. (Mandatory)
scm:fieldsEnclosedBy Defines the fields delimiter (Optional - default: ";")
scm:linesTerminatedBy Defines the line terminator (Optional - default: "\n")

Example:
# Definition of the class "CSVFile"
map:mysource   a                       scm:CSVFile;
               scm:pathfile            "C://scoma/table.csv";
               scm:fieldsEnclosedBy    "\t";   #default ";"
               scm:linesTerminatedBy   "\n";   #default "\n" 
               .	

Key

A scm:Key defines the key column in the schema table

Properties

scm:column Defines the column to map in the table (Mandatory)
scm:hasDimension Defines the key dimension. This dimension will be applied to each item belonging to the same row. (Mandatory - exactly one)
scm:order Defines the table column position. (Mandatory)
scm:transformFunction Defines the name of the function to apply in the Item generating procedure. The function must be present in triplisty.functions module. This property is useful when you want transform the key value (i.e. if the value in the table is "City of Rome", you could specify a function to replace the spaces with underscores in order to export "City_of_Rome" value). See example below.

Example:
# Definition of the class "Key"

map:key        a                      scm:Key;               
               scm:column             "city";    # A,B,C for XLS files, 1,2,3 for CSV files
               scm:hasDimension       map:dim_city;                        
               scm:order              "1";  
               scm:transformFunction  "replace_space"   
               .

For each key, the TripliSty tool will generate the item invoking on-the-fly the replace_space python function defined in the triplisty.functions:

def replace_space(self, value):
	return value.strip().replace(" ", "_")  

The value to transform is passed as argument, so you can convert it as you want.

Field

A scm:Field defines the field column in the schema table

Properties

scm:column Defines the column to map in the table (Mandatory)
scm:hasDimension Defines the column dimension(s). This dimension will be applied to the items belonging to the same column. (Mandatory - one or more)
scm:order Defines the table column position. (Mandatory)
scm:datatype Defines the datatype of the cell value (Optional - Availables: integer, float, string..See XML Schema)

Example:
# Definition of the class "Field"

map:field1     a                    scm:Field;
               scm:column           "male";  # A,B,C for XLS files, 1,2,3 for CSV files
               scm:hasDimension     map:dim_male;
               scm:datatype         "integer";
               scm:order            "2";
               .

DimensionInstance

A scm:DimensionInstance defines the SCOVO dimension instance properties

Properties

scm:hasDimensionType Defines the dimension instance type (of SCOVO Dimension subclasses) (Mandatory)
scm:uriDimensionInstance Defines the dimension instance URI (static or dynamic URIs) (Mandatory)
scm:uriPiece Defines the URI piece for the item generating procedure. (Mandatory) See example below.
scm:titlePiece Defines the title piece for the item generating procedure. (Optional) See example below.
scm:labelPiece Defines the label piece for the item generating procedure. (Optional) See example below.
scm:index Defines the piece index in the URI. Useful when we have multiple keys or multiple dimensions in the same field. (Mandatory)
scm:externalTriples Defines the function name to invoke in the rdf generating procedure. The function must be present in triplisty.functions module and you can use it when you want to generate on-the-fly triples to add to the dataset (ie. link with external dataset, geonames, dbpedia, etc.). See example below

In the dimension instance resources we must define how the TripliSty tool must build each single Item URI. We can have static and dynamic dimension.


Example of static dimension

map:dim_male   a                         scm:DimensionInstance;
               scm:hasDimensionType      :Genre;
               scm:uriDimensionInstance  "http://localhost/triplisty_example/dimension/genre/male";
               scm:uriPiece              "-m";
               scm:index                 "1";
               scm:titlePiece            ", genre: male";
               scm:labelPiece            ", genre: male";
               .

Each dataset and field dimension instances are examples of static dimensions. The uriDimensionInstance, the uriPiece, the titlePiece and the labelPiece values are static. In fact each uri dataset dimension instance will be applied to each Item belonging to the table and each uri field dimension instance will be applied to each Item belonging to the same column. The same for uriPiece, titlePiece and labelPiece.


Example of dynamic dimension

map:dim_city   a                          scm:DimensionInstance;
               scm:hasDimensionType       :City;
               scm:uriDimensionInstance   "http://localhost/triplisty_example/dimension/city/@@city@@";
               scm:uriPiece               "-@@city@@";
               scm:index                  "1";
               scm:titlePiece             ", city: @@city@@"; 
               scm:labelPiece             ", city: @@city@@"; 
               scm:externalTriples        "sameas_geonames";
               .

In this case, the uri key dimension instance, the uriPiece, the titlePiece and the labelPiece will be applied to each item belonging to the same row but, since we do not want to write manually each instance row dimension (it could be thousands of rows), we can specify a pattern in the uri. The part between @@ identifies the key column name that TripliSty tool will replace with the correct key value in item generation procedure. The same for uriPiece, titlePiece and labelPiece.

For each dimension instance you can specify also the externalTriples property in order to generate on-the-fly external rdf triples to add to the dimension instance resource. This value represents a function name that must be present in triplisty.functions module. If for example we want to link each city dimension instance with its geonames rdf link we can define the following python function:

 
def sameas_geonames(self, subject, dimension,  value):
    import urllib
    from rdflib import ConjunctiveGraph
    from rdflib import URIRef
        
    value = urllib.quote(value)
    client = ConjunctiveGraph()
    graph = ConjunctiveGraph()		
    client.parse("http://ws.geonames.org/searchRDF?q=" + value + "&maxRows=1")
    city_uri  = list(client.triples((None, Utils.RDF['type'], URIRef('http://www.geonames.org/ontology#Feature'))))[0][0]
    graph.add((URIRef(subject),URIRef("http://www.w3.org/2002/07/owl#sameAs") ,city_uri))
    return graph

The parameters passed to the function are:

In this way, querying the city dimension instances, TripliSty tool will add on-the-fly the new sameas triple making possible to navigate from a dataset to another.

The dimension types

At least we must define the dimension types as in a normal SCOVO ontology

# Domain schema definitions"
:Year   rdfs:subClassOf scv:Dimension ;
 	    dc:title "Year" . 

:City   rdfs:subClassOf scv:Dimension ; 
        dc:title "A city" . 

:Genre  rdfs:subClassOf scv:Dimension ; 
        dc:title "Genre" .
		   

A (very early - refactoring in progress) beta version of TripliSty is available here .
Bug alerts, feedbacks and suggestions are welcome. Write to gpirrotta@unime.it