Working with RDF Graph Databases in Python

5 minute read

In BGS, we work with a number of linked datasets which are stored as RDF graph databases. These include our own comprehensive lithology classification, the BGS Rock Classification Scheme, and a simpler international equivalent the CGI Simple Lithology vocabulary which BGS helped develop.

This blog post demonstrates how we use Python to parse such RDF data. In this example we traverse parent-child hierarchies within the CGI Simple Lithology scheme to identify all the groups to which a particular lithology belongs. We use this information to simplify colour attribution of lithologies when plotted on a map in our new field data capture tool. We chose to do this work in Python, as the tool was being developed as a Python plugin for QGIS.

How to parse a graph database in Python

To parse a graph database using Python, we are going to use the library rdflib. You can install this library using pip.

pip install rdflib

The file we are going to parse is a Turtle file (.ttl), which is a format for storing triples. This files comes from the CGI Simple Lithology dataset, which you can find on GitHub here.

Firstly, we have to create an empty graph using rdflib, and then use it to parse the given ttl file.

import rdflib

uri = "https://raw.githubusercontent.com/CGI-IUGS/cgi-vocabs/9dfe161affbe91de4c25622a9c2cfab5aa65c642/vocabularies/geosciml/simplelithology.ttl"
# Create new graph
graph = rdflib.Graph()
# Parse the ttl file from the URI
graph.parse(uri, format="ttl")

Next, we are only going to be looking for lithology subjects from the graph database, and we are only going to be looking for their skos:prefLabel and skos:broader predicates. Therefore, we will define some URIs which we will use to filter our data.

lithology_subject_prefix = "http://resource.geosciml.org/classifier/cgi/lithology/"
skos_pref_label_uri = rdflib.term.URIRef("http://www.w3.org/2004/02/skos/core#prefLabel")
skos_broader_uri = rdflib.term.URIRef("http://www.w3.org/2004/02/skos/core#broader")

We are now ready to start iterating over the triples within the graph. We will add this data to a Python dictionary. When we iterate over the graph, we get a single predicate object for a single subject. Therefore, we need to ensure we have the correct structure to store all of the objects for each predicate of each subject.

graph_dict = {}

for subject, predicate, object_ in graph:

    # Filter the triples where the subject URI starts with the lithology URI prefix
    if subject.startswith(lithology_subject_prefix):

        # Ensure the subject has a child dictionary to store predicates
        if subject not in graph_dict:
            graph_dict[subject] = {}

        # Ensure the predicate has a child list to store objects
        # We store the objects as lists because sometime there is more than 1 value
        # E.g. parent lithologies
        if predicate not in graph_dict[subject]:
            graph_dict[subject][predicate] = []

        # Add the new object to the dictionary
        graph_dict[subject][predicate].append(object_)

print(f"Found {len(graph_dict)} lithologies in the graph database!")
Found 265 lithologies in the graph database!

We now have a dictionary representation of our lithology classifications, from our graph database! We can use this to find the data for Rhyolite, which was mentioned in our earlier example.

# Import pretty printer to output the dictionary nicely
from pprint import pprint
rhyolite_uri = rdflib.term.URIRef('http://resource.geosciml.org/classifier/cgi/lithology/rhyolite')
pprint(graph_dict[rhyolite_uri])
{rdflib.term.URIRef('http://purl.org/dc/terms/provenance'): [rdflib.term.Literal('LeMaitre et al. 2002', lang='en')],
 rdflib.term.URIRef('http://www.w3.org/1999/02/22-rdf-syntax-ns#type'): [rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#Concept')],
 rdflib.term.URIRef('http://www.w3.org/2000/01/rdf-schema#isDefinedBy'): [rdflib.term.URIRef('http://resource.geosciml.org/classifierscheme/cgi/2016.01/simplelithology')],
 rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#altLabel'): [rdflib.term.Literal('liparita', lang='es'),
                                                                      rdflib.term.Literal('ryolit', lang='sv'),
                                                                      rdflib.term.Literal('Liparit', lang='de'),
                                                                      rdflib.term.Literal('liparite', lang='it'),
                                                                      rdflib.term.Literal('liparit', lang='ru'),
                                                                      rdflib.term.Literal('lipariitti', lang='fi')],
 rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#broader'): [rdflib.term.URIRef('http://resource.geosciml.org/classifier/cgi/lithology/rhyolitoid')],
 rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#definition'): [rdflib.term.Literal('rhyolitoid in which the ratio of plagioclase to total feldspar is between 0.1 and 0.65.', lang='en')],
 rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#exactMatch'): [rdflib.term.URIRef('http://inspire.ec.europa.eu/codelist/LithologyValue/rhyolite')],
 rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#example'): [rdflib.term.Literal('liparite', lang='en'),
                                                                     rdflib.term.Literal('rhyodacite', lang='en')],
 rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#inScheme'): [rdflib.term.URIRef('http://resource.geosciml.org/classifierscheme/cgi/2016.01/simplelithology')],
 rdflib.term.URIRef('http://www.w3.org/2004/02/skos/core#prefLabel'): [rdflib.term.Literal('silarIyU:lIt', lang='km'),
                                                                       rdflib.term.Literal('流纹岩', lang='zh'),
                                                                       rdflib.term.Literal('Rhyolith', lang='de'),
                                                                       rdflib.term.Literal('ryolit', lang='vi'),
                                                                       rdflib.term.Literal('流紋岩', lang='ja'),
                                                                       rdflib.term.Literal('유문암', lang='ko'),
                                                                       rdflib.term.Literal('riolit', lang='id'),
                                                                       rdflib.term.Literal('riolit', lang='ru'),
                                                                       rdflib.term.Literal('ryolit', lang='sv'),
                                                                       rdflib.term.Literal('¹ó\xadÄëºÄìª', lang='lo'),
                                                                       rdflib.term.Literal('rhyolite', lang='en'),
                                                                       rdflib.term.Literal('riolita', lang='es'),
                                                                       rdflib.term.Literal('ryoliitit', lang='fi'),
                                                                       rdflib.term.Literal('หินไรโอไรต์', lang='th'),
                                                                       rdflib.term.Literal('riolit', lang='ms'),
                                                                       rdflib.term.Literal('rhyolite', lang='fr'),
                                                                       rdflib.term.Literal('riolite', lang='it')]}

If we want to see its parents, we can access the object behind the predicate skos:broader. Here we will also use the predicate skos:prefLabel to access the English name for the parents.

broader_uris = graph_dict[rhyolite_uri][skos_broader_uri]
parents = []
for parent_uri in broader_uris:
    parent_labels = graph_dict[parent_uri][skos_pref_label_uri]
    # Get just the English label for each parent
    english_label = [label.toPython() for label in parent_labels if label.language == "en"][0]
    parents.append(english_label)

pprint(parents)
['rhyolitoid']

Finally, in this format of linked data, recursive functions are incredibly useful to traverse a graph from one node to another. Or, in the case of triples, from one subject to another. Here, we will build the code above into a recursive function to find the list of all parent lithology classifications above Rhyolite.

def find_lithology_parents(
    graph_dict,
    target_lithology_uri,
    parent_labels = set(),
):
    """
    Find all of the parent lithology classifications above the given target_lithology_uri.
    A set of all parent labels is returned.
    We use a set because this ensures only unique values are stored within it.

    The argument 'parent_labels' can be ignored when calling this function,
    as it is used to start an empty set which is populated as the recursion iterates.
    """
    # If the target lithology has parents
    if skos_broader_uri in graph_dict[target_lithology_uri]:

        for parent_uri in graph_dict[target_lithology_uri][skos_broader_uri]:
            # Get the current parent English label
            current_parent_labels = graph_dict[parent_uri][skos_pref_label_uri]
            english_label = [label.toPython() for label in current_parent_labels if label.language == "en"][0]

            # Add the current parent to the set of all parents
            parent_labels.add(english_label)

            # Find the parents of the parent
            parent_parents = find_lithology_parents(graph_dict, parent_uri, parent_labels)
            # Add the parents of the parent to the set of all parents
            parent_labels = parent_labels.union(parent_parents)

    return parent_labels


rhyolite_parents = find_lithology_parents(graph_dict, rhyolite_uri)
pprint(rhyolite_parents)
{'acidic_igneous_material',
 'acidic_igneous_rock',
 'compound_material',
 'fine_grained_igneous_rock',
 'igneous_material',
 'igneous_rock',
 'rhyolitoid',
 'rock'}

Knowing all possible parents of Rhyolite shows us where it fits in different categorisation schemes, e.g. based on grain size (fine_grained_igneous_rock) or composition (acidic_igneous_rock), and we can choose the most appropriate for a given application.

Alternative Solutions

It is worth noting that there are other ways of extracting data from a triplestore, such as SPARQL, which allows you to build queries on an RDF graph database resembling SQL. You can even use rdflib to run SPARQL queries in Python. More information on this from the rdflib documentation can be found here.

comments