DAFNI-LD, DCAT2 Profile Metadata
Introduction
DAFNI’s Dataset Metadata Standard lies at the heart of our platform; providing our data suppliers, data consumers and data managers with the information required to share, exploit and curate our data assets in the most efficient and effective way possible.
Our Dataset Metadata Standard is built around the DCAT2 specification. This foundation is augmented with additional content - conforming to ISO specifications and community-wide ontologies – in order to build a comprehensive Metadata Standard which is both tailored for the DAFNI platform, and is interoperable with other standards in the linked data world.
We recognise that over the lifetime of DAFNI, the demands of our community as well as the best practices within data management and the richness of content we need to represent, will all evolve. As such, we have designed our metadata approach to be flexible, allowing us to expand our Standard with new content reflecting the richness, diversity and complexity of the data landscape DAFNI must support.
Dataset Metadata Standard Overview
The DAFNI Metadata Standard provides a comprehensive framework for dataset metadata and includes both mandatory and ‘extended’ optional fields.
Mandatory metadata content
Mandatory content must be supplied before a new dataset can be accepted for inclusion on DAFNI and simply represents the minimum amount of information required to administer and exploit a dataset we host. Example fields include the name of the dataset, a description of its content, a point of contract, and its provenance. We use this information:
- To support long-term curation of the dataset – ensuring its quality, relevance and longevity over its lifetime on DAFNI.
- To maximise the potential utilisation of the dataset by our community (based on its access permissions) through both its ‘discoverability’ via DAFNI’s Search & Discovery Services, and subsequent exploitation through our Data Access services. For a full description of DAFNI’s mandatory metadata expectations please consult the Dataset Metadata Standard detail below.
Extended metadata content
Our extended metadata fields represent a suite of optional information a data supplier may wish to provide that will augment our community’s understanding of their dataset. Provision of this extended metadata supports more effective decision making by a potential data consumer, therefore increasing the quality of research outputs based upon it.
While all extended metadata fields remain optional we strongly recommend data suppliers populate as many fields as possible prior to submission.
The full definition of available metadata fields associated with a dataset hosted on DAFNI is available in the Dataset Metadata Standard detail below.
Metadata structure
To aid completion of a metadata entry for a new dataset, and subsequently navigate content associated with an existing dataset, we have initially divided our Standard into six thematic sections:
- Dataset-level information describing the properties and characteristics of the overall dataset – its name, a description of its content, primary language, the theme and/or subjects it encapsulates, etc.
- Provenance information describing the origins of the dataset, particularly useful when a dataset has been provided via a third party source (e.g. TfL, ONS, Ordnance Survey, a government department, etc.)
- Legal information pertaining to potential licensing arrangements, access and usage rights, etc.
- Temporal information describing the time range the dataset encapsulates.
- Spatial information describing the geospatial extent of the dataset.
- File-level information detailing properties of each file hosted within the dataset – its name, format (derived from file extension), size etc.
Providing metadata to DAFNI
For users contributing data to the DAFNI platform there are two ways to ensure that the metadata provided is compliant with our Metadata Standard, thus allowing a dataset to benefit from the platforms’ full data support capabilities.
Upload data and metadata through the front end For dataset which are of a size suitable for HTTP upload – as a guide we recommend file sizes of up to 50GB – then we recommend using the platform front end to create the metadata required. Our data upload page contains a form where mandatory and extended metadata fields are clearly indicated. There is also point-and-click support for geospatial domain definition and default options for data licencing.
Upload data via the DAFNI support team For very large datasets, we recommend contacting the DAFNI team at info@dafni.ac.uk for one-to-one support with upload and metadata completion. Please note that this will be a slower and more complex process than using the front end.
Technical Section
Metadata is as important as the data and we needed a foundation for a good data architecture in tune with the times. JSON-LD has positioned itself as the standard for linked data serialisation. Large entities like Google and CERN have adopted it. It is coupled with standardised vocabularies with different specialisations emerging, particularly in the field of Ontologies for Science. DAFNI is supposed to incorporate these as well as the custom ontologies developed by the users as par fo their research. We adopted DCATv2 because while retaining DCAT and DCAT-AP compatibility aligns with schema.org. Google recommends the latter but supports both. We decided that supporting GeoJSON was critical so we adopted GeoJSON-LD 1.0. This is our fundamental approach to support advanced geopatial search. These standards have strong momentum and staying power. They are the basis of what web 3.0 means.
Abbreviations
DCAT is the main vocabulary for describing datasets.
DCAT2 is the latest version of DCAT retaining backward compatibility while aligning with schema.org.
DCAT-AP Eurpean wide dataset description format that is a profile of DCAT.
FAIR adjecive to label the data that meets the principles of findability, accessibility, interoperability, and reusability.
FOSS Free and Open Source.
GDPR EU law transposed to UK law about individuals retaining control of their own personal information.
FOI Freedom Of Information (Request).
GEMINI A legal requirement to publish geospatial data for public institutions.
INSPIRE European wide version of GEMINI and its superset, also legally required for describing geospatial datasets.
IRI Internationalised URI. A URI can be URL or an URN. PIDs must be IRIs.
JSON-LD is a serialisation format based on JSON syntax incorporating linked data abilities.
PID Permanent ID.
Metadata schemas
Our metadata is based on the DCATv2 schema and associated namespaces. The users do not need to keep track of any additions but to use them if they see fit.
- Our metadata can be harvested by search engines, out of the box.
- Metadata is serialised as JSON-LD 1.1 ("application/ld+json", .jsonld but to be noted also "application/json") but intended to be backwards compatible with JSON.
- We allow GeoJSON and GeoJSON-LD syntax particularities safely (ouside JSON-LD 1.0), do require version 1.1 .
- The main vocabulary is the new DCAT2 standard vocabulary (and associated W3C namespaces, for consistency). This provides futureproofing. Some vocabularies added in anticipation of new features.
- Interoperation: Chosen vocabularies facilitate metadata conversions with international standards shall we ever need them.
- The formats and storage must support other types of metadata standards, like GEMINI. The original metadata must be preserved.
- We avoid custom definitions when any standards can do.
- Observing common practices by other entities including standard setting organisations but aiming for simplicity.
- DCAT2 normative namespaces were included.
- GeoJSON is supported via GeoJSO-LD 1.0 . It is included to control the implementation.
Advantages for the DAFNI community
- We meet all FAIR guidelines.
- The users will make their dataset indexable by search engines, simultanously.
- Our solution attains in the 5 star linked data rating
- Maximum specification flexibility while retaining control on types, ranges and meanings.
- DAFNI's internal dataset metadata is designed with flexibility in mind, and will evolve by expansion.
- Backward compatibility while retaining ability to add new features.
- The format will implement schemas from DCAT2 and other W3C associated vocabularies.
- Vocabularies selected to account fo future use beyond MVP.
- We can extend the chosen vocabularies list (largely generic for for now) to meet particular user needs.
- We can use ontologies for science, prioritising established ones.
- We can easily introduce novel custom ontologies potentially developed by the users.
Order of precedence
Some technical choices have to match technologies and practices with our desired policies so we have an order of precedence to deal with
- JSON-LD
- DCAT2
- Established practices by widespread standards like e.g. DCAT-AP, GEMINI, INSPIRE, etc.
- Established practices by large entities like e.g. Google, data.gov, etc.
- Own policy.
DAFNI Metadata Strategy
DAFNI metadata strategy is base on applying FAIR data principles as the foundation of the platform. To achieve these a set of de jure or de facto standards and policies have been adopted.
Content Policies
- id and type are aliased for JSON compatibility. We are using the '@' sign to reduce ambiguity throughout the text.
- Aliased, compacted or expanded terms in the vocabulary are valid(equivalent). We settled for compacted as good balance for MVP. However, they are equivalent.
- Coupled with above context validates as JSON-LD and JSON. In practice users will add it to their @vocab or is implicit.
- Validates in Google Structure Data Tool and JSON-LD official validator with correct expansions. DAFNI_MPV_metadata_single_jsonld.json
- The dct:spatial is correct (UK)but for nomis datasets the coverage should be England and Wales. It is just an example. It has to be in sync with the geojson. There are a lot of options too and sometimes one or many are shown.
General advice: JSON-LD and vocabularies
- @id and @type aliased as id and type.
- id can be any permanent IRI which in practical terms includes any URI/URL/URN.
- The data we pass to other services could be the contents of the key or a view (ETL or reinterpretation) of it.
- The DCAT2 and other specifications are the ultimate agreement (schema). Refer to the official W3C documentation, which is comprehensive.
- Structures replicate consistently. An object retains its structure regardless of where it is used. E.g. Person, Organization.
- The terms or properties/keys chosen here reflect current practices for semantic data on leading repositories.
- We remain DCATv2 compliant as long as the properties preserve it's meaning. Our custom expansions become DCAT profiles, which are backwards compatible.
DAFNI PID policy
As long as they are stable standards any PID is accepted (if can be viewed ultimately as pointing to a resource). The idea is to always have the possibility to discover further metadata about an object. The full URL is preferred instead of a URL element, particularly if building the URL is not obvious.
Our preferred options but equivalents could also be accepted. Recommended PIDs as @id:
- ORCiD (Researcher)
- ror.org ID (Research institution)
- DOI (Published research)
- ISSN/ISBN
- SPDX licence IDs URI (For FOSS or anything software related)
- Official licence terms sources like Creative Commons or Open Goverment Licence.
- Companies house numbers (URL or ID) for certain entities.
Identifier Policy
- DAFNI RESERVED: Any @id in practice is a PID or can be resolved to one. In the case of a dataset metadata it must point to ifself.
- DAFNI RESERVED: Datasets: Primary ID in @id.
- Secondary,alternative or external IDs must go in dct:identifier.
- Properties or Objects: If the id is external (IRI).
- Use PIDs (Permanent Identifies) in @id if they are external
- Other primary ID should also go into dct:identifier list, URL preferred. Including alternative repository sources.
- A copy of the dataset id goes into the identifier list
- Secondary/alternative ids goes into identifier list, like a offical data source (DOI)
- Literal or internal is goes into internalID (adms:identifier)
- De-referenceable IDs like DOI, ORCiD, ror.org go into object @id
- external or alternative sources, object owl:sameAs for URIs
Process Of Submission
The data and the metadata will be uploaded to a DAFNI temporary area to be later transferred to the NID. Upon upload to the NID, metadata is validated to ensure its structure is correct.
Metadata Quality
The metadata must be valid JSON too and stored as JSON. It must be in UTF-8 encoding, without BOM (Byte Order Mark).
Metadata Validation
- Metadata will be treated as a JSON view for automation, and is subject to a basic validation step to ensure it is valid JSON.
- Metadata is also validated against a JSONSchema file to ensure it conforms to DAFNI standards. The DAFNI metadata schema for upload can be found on the DAFNI GitHub.
- We recommend that users who wish to upload without using HTTP on the web front end should contact the DAFNI team. In time a CLI will be made available for this purpose. Users should check their metadata file by using the the Github Schema and a JSON validator such as this one.
- Note that this schema is for upload only and metadata available on the DAFNI platform includes fields which are internally generated, and therefore not present in the schema. Fields that are generated internally are listed below. These fields must not be included when uploading metadata through the CLI.
- Dataset id (@id)
- Issued & modified datetimes (dct:issued & dct:modified)
- File-level metadata (dcat:distribution)
The semantic validity aspects are covered by the controlled vocabularies specified in the DAFNI-LD context: meaning, types and ranges.
- All datetime stamps are ISO 8601
Dataset composition
- Dataset info ( landing page, identifier, language, title, description, keyword, theme, subject, conformsTo)
- Provenance (publisher, creator, qualifiedAtribution)
- Legal (contactPoint, licence, rights)
- Temporal (created, issued, modified, PeriodOfTime)
- Spatial (spatial, geojson)
- Files metatata (Distribution)
Dataset properties
compacted property | composition | cardinality | content |
---|---|---|---|
id | property | 1 | RESERVED |
type | property | 1 | dcat:Dataset |
dcat:landingPage | property | 1 | RESERVED |
dct:identifier | property | 1..* | Any permanent ID identifying this dataset. |
dct:language | property | 1 | Language of the dataset. |
dct:title | property | 1 | Short one liner title for the dataset. |
dct:description | property | 1 | Long description of the dataset. |
dcat:keyword | property | 1..* | List of custom keywords/tags (free choice) |
dcat:theme | property | 0..* | List of themes from a dataset, from the INSPIRE theme list. |
dct:subject | property | 1 | Category from the ISO 19115 list. |
dct:conformsTo | object | 1..* | Standards and policies the dcat:Dataset or dcat:Distribution follows. |
dct:publisher | object | 1..* | Publisher of the dataset. |
dct:creator | object | 1..* | Creator or author of the dataset |
prov:qualifiedAttribution | Object | 0..* | List of prov:Attribution. Alternative attributions to Person or Organization |
dcat:contactPoint | object | 1..* | Main contact for permissions (GDPR, FOI, rights) |
dct:license | object | 1 | Licence information. |
dct:rights | property | 1 | Permissions and requirements not necessarily stated in the licence. |
dct:created | property | 1 | Creation datetime. |
dct:issued | property | 1 | Publication datetime. |
dct:modified | property | 1 | Update datetime. |
dct:PeriodOfTime | object | 1 | Interval with the temporal coverage of the whole dataset. It can be open. |
dct:accrualPeriodicity | property | 1 | Update frequency |
dct:spatial | object | 1 | Geonames coverage information. |
geojson | object | 1 | Any RFC 7946 GeoJSON object content. |
dcat:Distribution | object | 1..* | List of file metadata objects. |
Object properties
conformsTo
property | content |
---|---|
id | URL |
type | dct:Standard |
rdfs:label | Short standard or policy name |
license
property | content |
---|---|
id | URL of the licence |
type | LicenseDocument |
rdfs:label | Short licence name |
creator or agent
property | content |
---|---|
id | URL as permanent ID if possible |
type | foaf:Person or foaf:Organization repectively. |
foaf:name | Full name of the person or the organisation. |
publisher
property | content |
---|---|
id | URL as permanent ID if possible |
type | foaf:Organization . Publisher limited to organisations. |
foaf:name | Full name of the person or the organisation. |
prov:Attribution
property | content |
---|---|
id | URL as permanent ID if possible |
type | prov:Attribution |
dcat:hadRole | ISO 19115-3 role |
prov:agent | foaf:Person or foaf:Organization . Publisher limited to organisations. |
contactPoint
property | content |
---|---|
type | vcard:Individual |
vcard:fn | Full name of the person. |
vcard:hasEmail | email of the person. |
PeriodOfTime
property | content |
---|---|
type | dct:PeriodOfTime |
time:hasBeginning | from datetime. |
time:hasEnd | until datetime. |
spatial
property | content |
---|---|
id | URL of the geonames location |
type | dct:Location |
rdfs:label | Geonames text name of the location. |
distribution
property | content |
---|---|
id | RESERVED |
type | dcat:Distribution |
dct:title | Tile of the datafile |
spdx:fileName | filename as stored |
dcat:downloadURL | RESERVED |
dcat:accessURL | RESERVED |
dcat:byteSize | size in bytes of the file |
dcat:mediaType | IANA MIME type |
csvw:tableSchema | Object containing a a list of CSV csvw:Column typed objects (name and dct:description pairs) |