The PSDAD Data Format

PSDAD (Plaintext Self-Describing Assertional Data) is an extensible data serialization format designed to support machine-readable data documents which can be reliably understood by untrained human readers.

This is a sketchy rough draft, trying to flesh out an idea.

## Introduction PSDAD (plaintext self-describing assertional data) is a new [data serialization format](https://en.wikipedia.org/wiki/Serialization). It is significantly different from [existing formats](https://en.wikipedia.org/wiki/Comparison_of_data_serialization_formats), including XML and JSON, in its use of free-form natural language text to delimit fields, identify types, and define semantics. PSDAD documents with an appropriate schema can be understood as natural language text by human readers, while also being easy for machines to serialize and deserialize. Despite appearances, there is no [natural language processing](https://en.wikipedia.org/wiki/Natural_language_processing) (or other "AI") involved, just [regular expressions](https://en.wikipedia.org/wiki/Regular_expression). The design goals, in order of decreasing importance: 1. The design needs to allow evolution-in-place. The initial release will be necessarily have many shortcomings, for instance in internationalization. As such, it must be clear how compatible future releases, possibly with syntactic extensions, can be deployed in practice. 1. Decentralized Extensibility. In a shared medium, like the open web, it should be easy to extend the domain of information being communicated, adding new features, without negatively impacting any currently deployed systems and without needing anyone's permission. Decentralized extensibility has thus far been provided in a few formats by using UUIDs or URIs, as in XML with namespaces or JSON-LD. 1. When using a well-design schema, the meaning of serialized data should be unambiguous, upon careful inspection, to people without technical training, such as members of a jury. This meaning should not depend upon external documents which could be altered to change the meaning of the serialization. This meanings should also be reasonably independent of the context in which the serialization was produced, transmitted, or received, so that it does not dangerously change meaning if the context is lost or misunderstood. 1. Using this serialization should make life easier for software developers, not harder. Of course, not all developers will find the same things easier, but it's clear that JSON has significant advantages on this front over most other serializations. It would be very hard to compete with JSON on the quick-and-dirty ease-of-use front, but perhaps the addition of other features, like documentation and versioning, can provide a system with tangible long-term benefit. 1. The average case performance should be linear, without a decrease in serialization or deserialization speed when handling large datasets or large schemas. Hopefully, the worst-case performance should be no worse than 10x JSON. 1. Serializations should be transmissible without additional encoding over available human text channels, like email, blog posts, microblogs (eg twitter, mastodon), and text messaging, so that machine-readable messages (that is, serializations of data) can use the existing mechanisms for authenticating people and for authorizing distribution of messages along existing social connections. ## Processing Model PSDAD **producers** are systems which contain data in their native data respresentation and in order to send it to other systems, they serialize it using some psdad schema, producing a character string which represents the data. That string may be written to a file or transmitted, usually using utf8. We use the term "string" in the abstract sense of a sequence of characters; it can be streamed and does not need to fit in memory. PSDAD **consumers** are system which read a string using a PSDAD schema and provide a native data representation of the data that was encoded in the string for use in some application. When producers and consumers are using the same schema and the same native data representations, if a producer serializes some data as a string and a consumer then deserializes it, the consumer should end up with a faithful copy of the data the producer started with. More importantly, the schemas and native representations can be different in useful ways and the transmission remains faithful under normal circumstances. In particular, a producer can encode data multiple different ways (typically old "compatibility" versions and new "extended" versions) and a consumer will automatically ignore types of content and versions which are not useful for its application purposes. In some situations, it is practical for the schema to be negotiated. For example, during an HTTP GET of a dataset representation from a server process, the client might include a "Prefer" header or query parameter in which it provides a URL for the schema it will be using to read the data. Given that, the producer can limit its output to the intersections of its schema and the client's, avoiding unnecessary data transfer and processing at both ends. This document specifies how conformant PSDAD Consumers turn strings into an internal data representation for use by applications. It also defines how conformant PSDAD Producers behave, generally by implication as whatever is necessary for the consumer to reproduce the data the consumer was intending to transmit. Processing is defined in three independent stages. Simple applications can be written using only the first stage. 1. data **parsing**: using template declarations in the application data schema, convert the overall psdoc string into a set of records, where each record is a mapping from slot-name to slot-string. 2. data **binding**: using slot declarations in the schema, convert each slot-string into a native data value. This conversion includes handling conventional datatypes like numbers and dates, as well more complicated handling of structured values and references (links) between them. 3. data **validation**: this optional stage uses additional slot declaractions to perform various checks, so that application code can safely assume data values it encounters will be as expected. ### Schemas A PSDAD schema consists of a sequence of templates. Each template consists of a sequence of parts, where each part is either a "literal" or a "slot". Literals are just strings. Slots are entities with various properties, such as a name and type, depending on context. Schemas, templates, and slots may have platform-dependent data attached, such as bindings to relevant application-defined classes, or transformation functions to execute when reading or writing data. As an example, consider a fictional producer of weather data, TempScan. TempScan has a set of weather stations and publishes a feed of automated observations. In our example, they are using the psdad.js library, so they define their schema using calls to its API like this:

const sch = new psdad.Schema()
shc.add(TempReading, 'The temperature at station [station] was [temp]C at time [timestamp].')
sch.add(WindReading, 'At station [station] the windspeed was [speed]k/h at time [timestamp].')

In this example, TempReading and WindReading are application classes. The library uses a notation where each template appears as a concatenation of its literals and slots, where the slot name appears in square brackets. In JSON, one might represent that same schema, without the class names, like this:

[
  [
    "The temperature at station ",
    { "name": "station" },
    " was ",
    { "name": "temp" },
    "C at time ",
    { "name": "timestamp" },
    "."
  ],
  [
    "At station ",
    { "name": "station" },
    " the windspeed was ",
    { "name": "speed" },
    "k/h at time ",
    { "name": "timestamp" },
    "."
  ]
]

The first template in this example schema consists of the following elements, in order: 1. The literal `"The temperature at station "` (note the space at the end of the string) 1. A slot (slot 0, named "station") 1. The literal `" was "` (note the spaces at the beginning and end of the string) 1. A slot (slot 1, named "temp") 1. The literal `" C at time "` 1. A slot (slot 2, named "timestamp") 1. The literal `"."` ISSUE: Define a serialization syntax for schemas, for that above use case where the client tells the server its schema. This would be excellent dogfood. ## Data Parsing We use the term "data parsing" to refer to the process of converting an input string of characters (a PSDAD document) into a sequence of data records, where a data record is an associative array, mapping slot-name to slot-text, along with a field indicating which template matched. PSDAD parsing is dependent on a schema. Unlike with JSON and XML, parsing without a schema is not possible. The schema used in parsing a string does not, however, need to exactly the match the schema used to produce the string. Instead, data should be coded and parsed using whatever schema is appropriate for the application at the time. The parsing rules are: 1. Templates must not contain adjacent slots and must not end with a slot. Parsers MUST reject any schema violating this condition. 1. The quote character (") U+0022 marks the beginning of a "quoted string", which continues until the matching end-quote. Any characters may occur in between, with backslash (\\) U+005C being used to escape any quote characters inside the string, and itself, as in JSON. In the current version, no other escape sequences are allowed and parsers MUST reject strings which use them. 1. All whitespace in the input outside of quoted strings is converted into a single space character ( ) U+0020 before matching. Whitespace inside quoted strings remains unchanged. 1. Template matching is attempted at successive character positions in the input. 1. If multiple templates match at that position, the one which matches the longest string is used. 1. If multiple templates match at a position, and they match the same number of characters of the input string, then the matching template is the one which occurs earliest in the schema. 1. A template matches a substring S of the input text when each of its literals and slots matches consecutive substrings of S, and either the input ends or the next character is a space. 1. Literals match substrings if they are identical strings. 1. A slot matches either a quoted string or the shortest possible substring of the input which enables the literal which follows the slot in its template to match. 1. When a template matches a substring, that text is consumed, and matching proceeds onward. Matches do not overlap. 1. Input text which does not match by these rules is ignored. The slot matching rule is key. It means that the literal after a slot acts as the delimiter of the slot. In emitting PSDAD text, to enable correct parsing, one must emit quoted strings whenever that delimiter occurs in the slot text. Under these rules, psdad strings can be parsed by a [DFA](https://en.wikipedia.org/wiki/Deterministic_finite_automaton), such as used by typical regular-expression engines. Parsers MAY have a "strict mode" which rejects input with non-matching input, rather than ignoring it, for use in contexts where decentralized extensibility is not appropriate. ISSUE: how to negotiate for strict mode? eg strict envelope using U+1 encoding. ### Walkthrough Let's consider parsing some weather data using the schema shown in [Example 1](ex-1-schema-for-tempscan-data-as-psdad-js-code).

This is TempScan dataset 3321.

The temperature at station 7 was 21.2C at time 2019-01-01T11:11:38-05:00.
At station 7 the windspeed was 0.4k/h at time 2019-01-01T11:11:38-05:00.
The temperature at station 9 was 21.2C at time 2019-01-01T11:11:38-05:00.

... TDB if useful ... ... maybe just show the output of --trace ... ### Test Suite A portion of the data-parsing test suite appears here. (Maybe: The suite itself is serialized here using psdad.) Let s1 refer to the following schema: A PSDAD data parser using ... ## Data Binding Issue: Does this need to be part of the psdad spec? Could this just be implementation dependent? This seems to just be psdad.js stuff. It would also be rather different in C++ for instance, or ML. Data binding in parsing is the process of transforming the records produced by the parser into native data structures suitable for application use. Data binding for serialization is the matching process of converting native data structures into records suitable for serialization. Data binding operates at two levels: * Datatype: serializing native data values (which might be arrays or references to objects) and parsing such serializations back into native data values * Object: converting between the record model and the application's native Object model. Both of these are optional. Applications which just work with strings have no need for using datatypes in the data binding layer, and applications which just work with untyped associative arrays (eg `Object` in JavaScript or `dict` in Python) have no need for Object conversion. Datatype binding is done by associating one or more native types with a slot, so the system can convert between what the application is working with and what's transmitted. ISSUE: how to parse apart the type and name and other features? Maybe it's: * [name] * [name, type name-of-type] *** My favorite, I think * [name, one of ("a", "b", "c")] * [type name] [int age] * [name: type] [age: int] [age :int] * [type: name] [int: age] Datatype vs class? In any case it's NATIVE type names. ISSUE: can you write a schmema that works in multiple languages, something like: * [age, js number, c int] That type and constraint information is used: 1. in validation during serialization or parsing 1. in input data binding to convert to suitable native types 1. in output data binding, when native value has multiple interfaces or isn't available for inspeaction 1. to communicate out-of-band to people using your data how you produce it, to make it easier for them to consume it Once you publish an output schema, you should only *add* to it, and mark things as deprecated, but never change or remove. But do we also need XSD datatype information? What if there are two different types with the same serialization? Doesn't happen much. Timestamps without timezeon meaning Z vs local? ### Datatypes Slots contain lexical representations of data values. The mapping from a set of strings to the values represented by each string is called a datatype. For example, using the datatype "number" the string "3" maps to the number three, while using the datatype "string", the string "3" simply maps to itself. Datatypes are *not* used for [data parsing](#Data_Parsing). Instead, they are part of [data binding](#Data_Binding), allowing the application develop to use datatypes suitable for their environment. One implication of this is that template text must be used to encode any important datatype information, rather than reyling on the system to convey this information. This is somewhat like [duck typing](https://en.wikipedia.org/wiki/Duck_typing) in programming languages. For example: * TempScan uses integer mmhg presure measurements: "The pressure at station [] is [:int32]." * QuickWeather reads it with "The pressure at station [] is [:float]." That's fine. No problem. But if: * TempScan writes with "The pressure at station [] is [:float]." * QuickWeather reads it with "The pressure at station [] is [:int32]." then things will be fine until TempScan actually uses a non-integer value, in which case QuickWeather's datatype handler will have to either throw an error or perform a lossy conversion (perhaps by rounding). ISSUE: To avoid this issue, which could be much more serious, the template literals can encode information about datatype: * "The pressure at station [] is [:number] (64bit IEEE 754, like 2.9e+8)" This is an example of where the templates can get cumbersome, as we're embedding the relevant parts of the authoring schema. Actually, this is a bad idea. Don't do it. Because what if you want to change it later? What are you really saying? You're saying in one statement that future statements will be a certain way? Ehhh. ---- tagging data types, eg hex vs decimal, because which is '10'? #### Strings #### Numbers #### References #### Lists ### Special Slots #### id #### subject ## Validation Could go in slots, like

shc.add(TempReading, 'The temperature at station [station] was [temp, type number, min -274, max 300]C at time [timestamp].')

It depends how much value there is in interchange and standardization. Is that better than?

TempReading.def = 'The temperature at station [station] was [temp]C at time [timestamp].')
TempReading.validate.temp = (x) => x > -274 && x < 300
shc.add(TempReading)

## Implementation Status Editor's implementation: [psdad.js](https://github.com/sandhawke/psdad.js) ## Some Issues / Ideas strict parsing, initiated by header This text contains a section which is likely to be misunderstood if not read and understood in sequential order. If any part of it is read without understanding all previous parts, the entire section may be significantly misunderstood. If you understand this, you may read the text which follows, between the start and end markers. It has been modified so that it is less likely to be accidentally read out of order; the modification is that the unicode code points of each character has been incremented; to read it, you must decrement the code points. <text begins here>[text]<text ends here>. ---- Not sure about the whitespace rules. Does it apply inside slot-text? If whitespace inside slot text is transformed, then we'll need to quote any text that happens to include a newline. What if it's two paragraphs? How would that be transmitted?? --- Lists are a datatype. Might be serialized several ways. We have a list. For input and output, each of which gets tried. eg JSON, markdown, and my [1] thing, which works better if we've lost newlines. 123,456 inus vs eu headers? key-to-understanding-0this-document? ref to key in msg? * reading is 123,456 [see notes] * reading is 123.456 [see note-7]