Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address Pitfalls of Numerical Datatypes in RDF #82

Open
jmkeil opened this issue Mar 8, 2021 · 2 comments
Open

Address Pitfalls of Numerical Datatypes in RDF #82

jmkeil opened this issue Mar 8, 2021 · 2 comments

Comments

@jmkeil
Copy link

jmkeil commented Mar 8, 2021

There are a couple of issues with numerical datatypes that make the accurate use of RDF for numerical data error-prone.

The use of xsd:float and xsd:double entails a risk

  • of value distortion in the mapping between lexical space and value space (e.g. "0.1"^^xsd:float is typically mapped to the value 0.1000000014901161), and
  • of numerical issues in the processing (e.g. calculations in SPARQL queries) of the represented values, i.e. underflow errors, overflow errors, rounding errors, cancellation, and error accumulation .

In most cases, xsd:decimal would be a better choice:

  • In particular, I disagree with XML Schema Datatypes in RDF and OWL, W3C Working Group Note 14 March 2006 on the point that xsd:float and xsd:double are the appropriate datatypes for measurements. In my point of view, this only holds in case of measurements that origin from binary floating point sources (e.g. numeric calculations or outputs of analog-to-digital converters). Other measures typically have a value and the measurement uncertainty of the used measurement device, resulting in the representation by two precise values, which should both be represented with xsd:decimal.
  • Another exception are cases, where a representation of Infinite is required, which is only provided by xsd:float and xsd:double.

The use of xsd:decimal for value representation does not considerably impede the use of floating point arithmetic for calculations (e.g. for performance reasons), as the conversion is trivial. In contrast, if a rounding of the lexical representation must be avoided, the other direction would require non standard-conform and (depending on the framework) probably cumbersome to implement custom lexical mappings, and is not always possible (e.g. inside of SPARQL queries).

However, I don't see awareness for these issues in general and especially in teaching material.

Further, RDF unnecessarily inherits limitations from XSD: Exponential notation is only supported for xsd:float and xsd:double, but not for xsd:decimal (and derived datatypes). It was not included into xsd:decimal as the requirement was already meet with the precisionDecimal datatype, which however, did not become a built-in datatype in RDF. This tempts users to use xsd:double even if not appropriated. The shorthand syntax in Turtle, TriG and SPARQL additionally amplifies this, as xsd:double might be used even if not intended.

(A more detailed discussion of the issues can be found in arXiv:2011.08077 and some reviewer comments on it.)

Possible Actions

I think the following actions would help to ease the accurate representation of numbers in RDF:

  1. Enable exponential notation for xsd:decimal (and derived datatypes) in RDF.
  2. Emphasis in teaching material the implicated risk of numerical issues and the only partial coverage between lexical space and value space of xsd:float/xsd:double resulting in rounded values after the lexical mapping.
  3. Enable tools to hint for the use of xsd:decimal in favor of xsd:float and xsd:double and to warn users if a lexical xsd:float or xsd:double value was entered which would require rounding during the lexical mapping.
  4. Maybe change Turtle, TriG and SPARQL syntax to use exponential notation as shorthand syntax for xsd:decimal instead of xsd:double.

One to three would not cause any backward compatibility problems. Four however, would obviously cause backward compatibility problems ins software, but might at the same time increase the accuracy of value representations in existing RDF documents without change.

Further, one could think about adding mandatory support for precisionDecimal (to have an arbitrary precision datatype with a representation of Infinite), but that is a new feature and goes beyond making RDF easier.

@jmkeil
Copy link
Author

jmkeil commented Feb 20, 2023

To make this issue more actionable, here a little more details, some thoughts about requirements and a solution sketch.


Problem

  1. For the datatypes xsd:float and xsd:double multiple lexical representations get mapped to the same value using rounding. For example, "0.1"^^xsd:float gets mapped to 0.100 000 001 4.... This fools data curators to state precise numbers, when actually stating slightly different values.
  2. xsd:float and xsd:double force compliant implementations to use floating point arithmetic, or to use rounded input values for a calculation with decimal arithmetic with arbitrary precision. xsd:decimal forces full compliant implementations to use decimal arithmetic with arbitrary precision, or forces limited compliant implementations to preserve a precision of at least 16 digits (one more than double precision floating point arithmetic guaranties). Even popular implementations (e.g. Virtuoso) fail to comply to this. The actually needed precision of calculations is a matter of the application problem, not the data used. However, RDF requires data curators to make a decision about them. Currently, RDF restrict the selection of the arithmetic reasonable for a problem, which might make compliant implementations less efficient, harder or impossible to write (e.g. due to hardware capabilities, response time constraints and language/library support), or less precise than required.
  3. Syntactic sugar in JSON-LD, Turtle, TriG and SPARQL, as well as missing support for infinite values, NaN (see e.g. OM issue 57) and the exponential notation support tempts data curators to use xsd:float and xsd:double and thereby to distort the stated values.

For a more detailed description of the problem refer to The Problem with XSD Binary Floating Point Datatypes in RDF (talk recording).

Requirements

A couple of requirements follows from these problems:

  1. Avoid partial coverage of lexical spaces by value spaces to avoid ambiguity and to not fool data curators.
  2. Do not restrict the choice of an arithmetic with the data.
  3. Permit exponential notation for arbitrary precise numbers.
  4. Existing data can be used by new software.
  5. Existing distorted data get fixed.
  6. Enable explicit binary representation of IEEE 754 binary32 (float) or IEEE 754 binary64 (double) values that can not get misinterpreted as decimal number.

Solution Draft

As a basis for discussion I would like to propose the following (challenging/maybe unrealistic) list of changes to address the problem:

  1. Add exponential notation to the lexical space of xsd:decimal.
  2. Add NaN, -Inf, Inf, and +Inf to the lexical space and value space of xsd:decimal.
  3. Relax the minimal 16 digits constraint for xsd:decimal on minimally conforming implementations.
  4. Add datatype …:HexFloat with lexical spaces 0x0000 to 0xffff/0xFFFF and value space of IEEE 754 binary32.
  5. Add datatype …:HexDouble with lexical spaces 0x00000000 to 0xffffffff/0xFFFFFFFF and value space of IEEE 754 binary64.
  6. Interpret non integer numbers in JSON-LD as xsd:decimal instead of xsd:double.
    • Permitted according to ECAM-404.

    • Permitted according to RFC8259:

      This specification allows implementations to set limits on the range
      and precision of numbers accepted. Since software that implements
      IEEE 754 binary64 (double precision) numbers [IEEE754] is generally
      available and widely used, good interoperability can be achieved by
      implementations that expect no more precision or range than these
      provide, in the sense that implementations will approximate JSON
      numbers within the expected precision. A JSON number such as 1E400
      or 3.141592653589793238462643383279 may indicate potential
      interoperability problems, since it suggests that the software that
      created it expects receiving software to have greater capabilities
      for numeric magnitude and precision than is widely available.

      Summarized: Expect non IEEE 754 binary64 values to get approximated.

    • Possible due to point 1, 2 and 3.

  7. Interpret numbers in exponential notation in Turtle as xsd:decimal instead of xsd:double. Possible due to point 1.
  8. Interpret numbers in exponential notation in TriG as xsd:decimal instead of xsd:double. Possible due to point 1.
  9. Interpret numbers in exponential notation in SPARQL as xsd:decimal instead of xsd:double. Possible due to point 1.
  10. Interpret explicitly typed xsd:float and xsd:double literals as xsd:decimal. Possible due to point 1 and 2.
  11. Deprecate xsd:float and xsd:double. Possible due to point 6 to 10.

Compatibility Considerations

Old implementations with new data:

  • might fail to parse xsd:decimal literals with exponential notation
  • might fail to parse xsd:decimal literals with NaN, -Inf, Inf, or +Inf
  • can not parse …:HexFloat literals
  • can not parse …:HexDouble literals

New implementations with old data:

  • value of xsd:float and xsd:double literals might slightly change
    • in most cases, this removes value distortion / improves data quality

Old implementations interacting with new/upgraded implementations:

  • might fail due to xsd:float or xsd:double literals in a SPARQL query result that turn into xsd:decimal literals

This would of course not be the easiest change to the RDF standards, especially as it also touches the XML standards. But I think, it is important to address this to make RDF a reliable framework for the representation of numeric data. What do you think about it? (e.g. @afs, @VladimirAlexiev, @gkellogg, @namedgraph)

@namedgraph
Copy link

@danbri might have an opinion :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants