use regular expressions to resolve literals; URIs, CURIEs #83

VladimirAlexiev · 2022-09-01T15:36:14Z

As a data architect
I want YAML plain values to be recognized by regular expression
So that I don't have to explicitly tag them

The YAML spec https://yaml.org/spec/1.2.2/#332-resolved-tags says
"Application specific tag resolution rules should be restricted to resolving the “?” non-specific tag, most commonly to resolving plain scalars. These may be matched against a set of regular expressions to provide automatic resolution of integers, floats, timestamps and similar types."
The default "JSON schema" https://yaml.org/spec/1.2.2/#102-json-schema includes regexes for matching integers and floats
The extended "Core schema" https://yaml.org/spec/1.2.2/#103-core-schema includes regexes for matching integers, floats, NaNs, null, booleans
The https://yaml.org/type collection lists further interesting cases:
- https://yaml.org/type/timestamp.html for timestamps (such examples used to be in the spec, but are being deprecated)
- https://yaml.org/type/value.html for extensibility of strings with extra attributes
We could define regexes for matching URNs and IRIs instead of tagging them explicitly YAML-LD IRI tags #79

Examples (Tagging @OR13 and @mgh128 who work with EPCIS data):

"@context": 
  epcis: https://ns.gs1.org/epcis/
issued: 2022-09-01
stringThatOnlyLooksLikeADate: !string 2022-09-01
homepage: https://example.org/foo
stringThatOnlyLooksLikeAUrl: !string https://example.org/foo
urlThatMayBeMisspelled: !anyURI hxxp:\\i-cannot spell,con/my home page
epcis:readPoint: urn:epc:id:sgln:952005385.011.ts4711
epcis:epcList: https://id.gs1.org/01/70614141123451/21/2018

Note: the benefit of datatype xsd:anyURI (tag !anyURI) is that:

the URL may be misspelled, because it is stored as a literal and semantic repos (at least rdf4j) don't check its syntax. How's that a benefit? Try to import a million CrunchBase "homepage" props and you'll see
OWL ontologies frown upon ObjectProperties that don't lead to any triples (eg rdf:type owl:NamedIndividual and some others)

We could also use explicit delimiters eg <...> around URNs (URIs, IRIs), which will also enable the use of CURIEs.
Eg below each of the props epcis:readPoint, epcis:epcList has 2 identical values (first a full URN, then a CURIE),
without having to declare that these are @id properties:

"@context": 
  epcis: https://ns.gs1.org/epcis/
  gtin: https://id.gs1.org/01/
  sgln: "urn:epc:id:sgln:"
epcis:readPoint: 
  - <urn:epc:id:sgln:952005385.011.ts4711>
  - <sgln:952005385.011.ts4711>
epcis:epcList:
  - <https://id.gs1.org/01/70614141123451/21/2018>
  - <gtin:70614141123451/21/2018>

The CURIE spec says

rules for disambiguation in situations where the same string could be interpreted as either a CURIE or an IRI.

I've seen this once in practice:

geo:lat (prefix for eg the WGS ontology), vs
<geo:1.23,4.56> (point using the geo: URI scheme, so the above prefix if used in a context precludes you from using this scheme)

One way to do this is to require that all CURIEs be expressed as Safe_CURIEs, implying that all unbracketed strings are to be interpreted directly as IRIs.

Safe_CURIEs use [...] delimiters, so we could rewrite the above example as follows:

"@context": 
  epcis: https://ns.gs1.org/epcis/
  gtin: https://id.gs1.org/01/
  sgln: "urn:epc:id:sgln:"
epcis:readPoint: 
  - urn:epc:id:sgln:952005385.011.ts4711
  - [sgln:952005385.011.ts4711]
epcis:epcList:
  - https://id.gs1.org/01/70614141123451/21/2018
  - [gtin:70614141123451/21/2018]

Here we avoid the need for any delimiters in URNs, but the brackets can be confused for "array in flow style":

epcis:epcList: [ https://id.gs1.org/01/70614141123451/21/2018, [gtin:70614141123451/21/2018] ]

We'd need extra spaces around the array brackets, and some damn specialized YAML parsers to grok this.

@ioggstream @gkellogg @anatoly-scherbakov

Do you think this is useful, or on the contrary, it is dangerous?
Should we define an "extended RDF schema" with such regexes, or stay clear of it?
Which variants for expressing URI/CURIE do you like?

The text was updated successfully, but these errors were encountered:

VladimirAlexiev · 2022-09-01T15:47:09Z

When a property is @id, its value must be resolved to URN/IRI regardless of any regexes.
I should be able to write:

"@context":
  "@sigil": $
  $base: https://friends.com/
  knows: {$type: $id}
$id: valexiev
name: Vladimir
knows: 
  $id: gkellogg
  name: Gregg

Because writing $id: !id valexiev would be "unbearable".

OR13 · 2022-09-01T19:40:52Z

I like these example... and in this context, the use of CURIEs.

anatoly-scherbakov · 2022-09-02T05:26:56Z

I feel concerned about both

reusing the list syntax,
and employing a non standard parser.

But the ussue is there.

Perhaps tags like !curie would help?

VladimirAlexiev · 2022-09-05T06:50:13Z

@gkellogg wrote

YAML tags already use an expression for URIs, although possibly not completely compatible, and with implementation issues for things like '#'.

Can you elaborate? Where?

For dealing with values with the proposed node-tag !id, we would need to rely on a description certainly for distinguishing blank nodes, and allow for relative IRIs and compact IRIs. Probably just defer to the paragraph on Node Objects from JSON-LD to have the same value space and semantics as @id:

If the node object contains the @id key, its value MUST be an IRI reference, or a compact IRI (including blank node identifiers). See § 3.3 Node Identifiers, § 4.1.5 Compact IRIs, and § 4.5.1 Identifying Blank Nodes for further discussion on @id values.

I also like treating URIs and CURIEs in a uniform way (so @anatoly-scherbakov , not using [...] for CURIEs).
So @gkellogg is the idea to go like this:

If the context knows that epcis:epcList is an object prop:

epcis:epcList:
  - https://id.gs1.org/01/70614141123451/21/2018
  - gtin:70614141123451/21/2018
# OR
epcis:epcList: [https://id.gs1.org/01/70614141123451/21/2018, gtin:70614141123451/21/2018]

If the context doesn't know, or epcis:epcList is a mixed prop:

epcis:epcList:
  - !id https://id.gs1.org/01/70614141123451/21/2018
  - !id gtin:70614141123451/21/2018
# OR
epcis:epcList: [!id https://id.gs1.org/01/70614141123451/21/2018, !id gtin:70614141123451/21/2018]

ioggstream · 2022-09-05T16:15:55Z

@VladimirAlexiev I didn't know CURIE.

Probably something like the following can work, but this probably requires a specialized namespace.

foo: !curie [gtin:70614141123451/21/2018]

See security considerations related to tags. They are valid in general and not only in this case. (see https://www.ietf.org/archive/id/draft-ietf-httpapi-yaml-mediatypes-03.html#name-arbitrary-code-execution).

This can be packed with other tag-related features in a specific namespace.

gkellogg · 2022-09-05T20:13:49Z

@gkellogg wrote

YAML tags already use an expression for URIs, although possibly not completely compatible, and with implementation issues for things like '#'.

Can you elaborate? Where?

The %TAG directive maps a c-tag-handle (such as "!xsd!") to an ns-tag-prefix, which can be an ns-uri-char*. Similarly, a node tag such as !<http://www.w3.org/2001/XMLSchema#dateTime> can be a c-verbatim-tag. Both are effectively URIs, with provisions for escaping characters (such as for IRIs). Therefore, node tags interpreted as Literal datatypes, are already in the form of a URI, and when resolved to the Representation Graph, these should be fully resolved on each node.

For dealing with values with the proposed node-tag !id, we would need to rely on a description certainly for distinguishing blank nodes, and allow for relative IRIs and compact IRIs. Probably just defer to the paragraph on Node Objects from JSON-LD to have the same value space and semantics as @id:

If the node object contains the @id key, its value MUST be an IRI reference, or a compact IRI (including blank node identifiers). See § 3.3 Node Identifiers, § 4.1.5 Compact IRIs, and § 4.5.1 Identifying Blank Nodes for further discussion on @id values.

I also like treating URIs and CURIEs in a uniform way (so @anatoly-scherbakov , not using [...] for CURIEs). So @gkellogg is the idea to go like this:

If the context knows that epcis:epcList is an object prop:
epcis:epcList:
  - https://id.gs1.org/01/70614141123451/21/2018
  - gtin:70614141123451/21/2018
# OR
epcis:epcList: [https://id.gs1.org/01/70614141123451/21/2018, gtin:70614141123451/21/2018]
If the context doesn't know, or epcis:epcList is a mixed prop:
epcis:epcList:
  - !id https://id.gs1.org/01/70614141123451/21/2018
  - !id gtin:70614141123451/21/2018
# OR
epcis:epcList: [!id https://id.gs1.org/01/70614141123451/21/2018, !id gtin:70614141123451/21/2018]

That was basically my thought, although depending on context resolution is a major shortcoming, as I've tried to avoid duplicating the context processing as is done in the JSON-LD Expansion process, so this might be limited to only the top-most context definition without consideration of scoped-contexts. It's definitely a weakness. Relying on specific use of !id would avoid this, but then is little different from simply using {"@id": ...}, in effect. This may not end up being a profitable line of investigation.

Admittedly, my understanding of YAML is pretty basic, so there may be some details of the YAML syntax which are either incompatible, or not properly exploited in these ideas.

ioggstream · 2022-09-06T07:17:40Z

my understanding of YAML is pretty basic, so there may be some details of the YAML syntax which are either incompatible, or not properly exploited in these ideas

in my experience, it needs quite some time to exploit all the possible ideas. I think that once we "release" the basic profile of yamlld and its media type registration, the json-ld ecosystem will provide enough material to speriment all those not-yet-standardized ideas in the real world.

We will have enough experience to address all the actual issues that will arise to design an extension and identify best practices.

VladimirAlexiev · 2022-09-19T15:23:38Z

@anatoly-scherbakov I also worry about non-standard YAML parsers (even before this issue), since I haven't seen any parser to properly handle custom tag definitions. As @gkellogg wrote, you can't even use a URI in the tag definition, but have to use the weird tag: scheme.

Did we share the "parser testing" matrix here?
Do we dare to declare conformance features that would require a good YAML parser supporting all features that we adopt?

@gkellogg

YAML tags already use an expression for URIs,

Ok, but you mean resolving tags to URLs. I mean associating regexes with tags so you don't need to use a tag with the value.

CURIEs are a bit of a side topic in this issue:

JSON-LD already handles CURIEs: if a prop is @id, JSON-LD knows to apply all context prefixes while resolving it.
Of course, to resolve a CURIE, one needs to declare the respective prefix
So I don't think we need to treat CURIEs differently from URIs: both require:
- the prop to be declared as @id,
- or if it's an unspecified or mixed prop, then the value must use an !id tag

gkellogg · 2022-09-19T16:17:56Z

@anatoly-scherbakov I also worry about non-standard YAML parsers (even before this issue), since I haven't seen any parser to properly handle custom tag definitions. As @gkellogg wrote, you can't even use a URI in the tag definition, but have to use the weird tag: scheme.

This was only true for one parser I tried written in Perl, but I've lost the reference now. LibYAML, which is fairly widely used, requires an escape of the '#' character, but otherwise seems to parse ASCII-space URIs.

Did we share the "parser testing" matrix here?

Do we dare to declare conformance features that would require a good YAML parser supporting all features that we adopt?

Given the varying state of support for the full spec, it would be good do run some cross-platform tests to identify restrictions on using such features.

@gkellogg

YAML tags already use an expression for URIs,

Ok, but you mean resolving tags to URLs. I mean associating regexes with tags so you don't need to use a tag with the value.

In the extended mode, operating on the Representation Graph, we could probably add additional regular expressions to identify types of literals, for example dates, times, dateTimes, and various number formats similar to how specified in Tag Resolution.

Given legal relative forms, doing so for an IRI is challenging, but the forms defined in RFC3986/7 can at least determine if one is considered valid. It may be that it would be too eager, and consider "foo" as being an IRI, as it is a valid path component. Limiting it to full/absolute IRIs would help, but it's still very broad; basically, anything with a ':' could be considered an IRI.

...

VladimirAlexiev · 2022-11-17T13:29:08Z

Paul Tyson [email protected] to [email protected], Nov 17, 2022:
I have a property that can take any type of RDF term as a value.

{
     "@context": {
     "ex": "http://example.org/ns/",
     },
     "ex:thing1": {"ex:foo": 1},
     "ex:thing2": {"ex:foo": "a string"},
     "ex:thing3": {"ex:foo": "http://example.org/yugo"}
     "ex:thing4": {"ex:foo": "2022-11-16T21:04:41"}
}

Is there any way to construct the context to make this come out in RDF like:

_:b0 <http://example.org/ns/thing1> _:b1 .
_:b0 <http://example.org/ns/thing2> _:b2 .
_:b0 <http://example.org/ns/thing3> _:b3 .
_:b0 <http://example.org/ns/thing4> _:b4 .
_:b1 <http://example.org/ns/foo>
"1"^^<http://www.w3.org/2001/XMLSchema#integer> .
_:b2 <http://example.org/ns/foo> "a string" .
_:b3 <http://example.org/ns/foo> <http://example.org/yugo> .
_:b4 <http://example.org/ns/foo>
"2022-11-16T21:04:41"^^<http://www.w3.org/2001/XMLSchema#dateTime> .

Vladimir: There's no way to do this in JSON-LD.

The first value is a JSON integer, which I think will come out as xsd:int. But a very large one will probably be cast to xsd:float
All other values are strings: JSON has no datatypes for URL or datetime.

You're asking to leverage regexes to attach appropriate datatypes to literals. I've only seen this in Perl:

We're discussing similar stuff for YAML-LD, see this issue

OR13 · 2022-11-17T13:41:52Z

I made this a few weekends back:

https://github.com/transmute-industries/jsonld-to-cypher

It has something related internally here:

https://github.com/transmute-industries/jsonld-to-cypher/blob/main/src/utils.js#L30

VladimirAlexiev added the UCR Issue on Use Case/Recommendation label Sep 1, 2022

VladimirAlexiev mentioned this issue Sep 1, 2022

YAML-LD IRI tags #79

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use regular expressions to resolve literals; URIs, CURIEs #83

use regular expressions to resolve literals; URIs, CURIEs #83

VladimirAlexiev commented Sep 1, 2022 •

edited

Loading

VladimirAlexiev commented Sep 1, 2022

OR13 commented Sep 1, 2022

anatoly-scherbakov commented Sep 2, 2022

VladimirAlexiev commented Sep 5, 2022

ioggstream commented Sep 5, 2022

gkellogg commented Sep 5, 2022

ioggstream commented Sep 6, 2022

VladimirAlexiev commented Sep 19, 2022

gkellogg commented Sep 19, 2022

VladimirAlexiev commented Nov 17, 2022 •

edited

Loading

OR13 commented Nov 17, 2022

use regular expressions to resolve literals; URIs, CURIEs #83

use regular expressions to resolve literals; URIs, CURIEs #83

Comments

VladimirAlexiev commented Sep 1, 2022 • edited Loading

VladimirAlexiev commented Sep 1, 2022

OR13 commented Sep 1, 2022

anatoly-scherbakov commented Sep 2, 2022

VladimirAlexiev commented Sep 5, 2022

ioggstream commented Sep 5, 2022

gkellogg commented Sep 5, 2022

ioggstream commented Sep 6, 2022

VladimirAlexiev commented Sep 19, 2022

gkellogg commented Sep 19, 2022

VladimirAlexiev commented Nov 17, 2022 • edited Loading

OR13 commented Nov 17, 2022

VladimirAlexiev commented Sep 1, 2022 •

edited

Loading

VladimirAlexiev commented Nov 17, 2022 •

edited

Loading