PSDAD (Plaintext Self-Describing Assertional Data) is an extensible data serialization format designed to support machine-readable data documents which can be reliably understood by untrained human readers.
const sch = new psdad.Schema() shc.add(TempReading, 'The temperature at station [station] was [temp]C at time [timestamp].') sch.add(WindReading, 'At station [station] the windspeed was [speed]k/h at time [timestamp].')In this example, TempReading and WindReading are application classes. The library uses a notation where each template appears as a concatenation of its literals and slots, where the slot name appears in square brackets. In JSON, one might represent that same schema, without the class names, like this:
[ [ "The temperature at station ", { "name": "station" }, " was ", { "name": "temp" }, "C at time ", { "name": "timestamp" }, "." ], [ "At station ", { "name": "station" }, " the windspeed was ", { "name": "speed" }, "k/h at time ", { "name": "timestamp" }, "." ] ]The first template in this example schema consists of the following elements, in order: 1. The literal `"The temperature at station "` (note the space at the end of the string) 1. A slot (slot 0, named "station") 1. The literal `" was "` (note the spaces at the beginning and end of the string) 1. A slot (slot 1, named "temp") 1. The literal `" C at time "` 1. A slot (slot 2, named "timestamp") 1. The literal `"."` ISSUE: Define a serialization syntax for schemas, for that above use case where the client tells the server its schema. This would be excellent dogfood. ## Data Parsing We use the term "data parsing" to refer to the process of converting an input string of characters (a PSDAD document) into a sequence of data records, where a data record is an associative array, mapping slot-name to slot-text, along with a field indicating which template matched. PSDAD parsing is dependent on a schema. Unlike with JSON and XML, parsing without a schema is not possible. The schema used in parsing a string does not, however, need to exactly the match the schema used to produce the string. Instead, data should be coded and parsed using whatever schema is appropriate for the application at the time. The parsing rules are: 1. Templates must not contain adjacent slots and must not end with a slot. Parsers MUST reject any schema violating this condition. 1. The quote character (") U+0022 marks the beginning of a "quoted string", which continues until the matching end-quote. Any characters may occur in between, with backslash (\\) U+005C being used to escape any quote characters inside the string, and itself, as in JSON. In the current version, no other escape sequences are allowed and parsers MUST reject strings which use them. 1. All whitespace in the input outside of quoted strings is converted into a single space character ( ) U+0020 before matching. Whitespace inside quoted strings remains unchanged. 1. Template matching is attempted at successive character positions in the input. 1. If multiple templates match at that position, the one which matches the longest string is used. 1. If multiple templates match at a position, and they match the same number of characters of the input string, then the matching template is the one which occurs earliest in the schema. 1. A template matches a substring S of the input text when each of its literals and slots matches consecutive substrings of S, and either the input ends or the next character is a space. 1. Literals match substrings if they are identical strings. 1. A slot matches either a quoted string or the shortest possible substring of the input which enables the literal which follows the slot in its template to match. 1. When a template matches a substring, that text is consumed, and matching proceeds onward. Matches do not overlap. 1. Input text which does not match by these rules is ignored. The slot matching rule is key. It means that the literal after a slot acts as the delimiter of the slot. In emitting PSDAD text, to enable correct parsing, one must emit quoted strings whenever that delimiter occurs in the slot text. Under these rules, psdad strings can be parsed by a [DFA](https://en.wikipedia.org/wiki/Deterministic_finite_automaton), such as used by typical regular-expression engines. Parsers MAY have a "strict mode" which rejects input with non-matching input, rather than ignoring it, for use in contexts where decentralized extensibility is not appropriate. ISSUE: how to negotiate for strict mode? eg strict envelope using U+1 encoding. ### Walkthrough Let's consider parsing some weather data using the schema shown in [Example 1](ex-1-schema-for-tempscan-data-as-psdad-js-code).
This is TempScan dataset 3321. The temperature at station 7 was 21.2C at time 2019-01-01T11:11:38-05:00. At station 7 the windspeed was 0.4k/h at time 2019-01-01T11:11:38-05:00. The temperature at station 9 was 21.2C at time 2019-01-01T11:11:38-05:00.... TDB if useful ... ... maybe just show the output of --trace ... ### Test Suite A portion of the data-parsing test suite appears here. (Maybe: The suite itself is serialized here using psdad.) Let s1 refer to the following schema: A PSDAD data parser using ... ## Data Binding Issue: Does this need to be part of the psdad spec? Could this just be implementation dependent? This seems to just be psdad.js stuff. It would also be rather different in C++ for instance, or ML. Data binding in parsing is the process of transforming the records produced by the parser into native data structures suitable for application use. Data binding for serialization is the matching process of converting native data structures into records suitable for serialization. Data binding operates at two levels: * Datatype: serializing native data values (which might be arrays or references to objects) and parsing such serializations back into native data values * Object: converting between the record model and the application's native Object model. Both of these are optional. Applications which just work with strings have no need for using datatypes in the data binding layer, and applications which just work with untyped associative arrays (eg `Object` in JavaScript or `dict` in Python) have no need for Object conversion. Datatype binding is done by associating one or more native types with a slot, so the system can convert between what the application is working with and what's transmitted. ISSUE: how to parse apart the type and name and other features? Maybe it's: * [name] * [name, type name-of-type] *** My favorite, I think * [name, one of ("a", "b", "c")] * [type name] [int age] * [name: type] [age: int] [age :int] * [type: name] [int: age] Datatype vs class? In any case it's NATIVE type names. ISSUE: can you write a schmema that works in multiple languages, something like: * [age, js number, c int] That type and constraint information is used: 1. in validation during serialization or parsing 1. in input data binding to convert to suitable native types 1. in output data binding, when native value has multiple interfaces or isn't available for inspeaction 1. to communicate out-of-band to people using your data how you produce it, to make it easier for them to consume it Once you publish an output schema, you should only *add* to it, and mark things as deprecated, but never change or remove. But do we also need XSD datatype information? What if there are two different types with the same serialization? Doesn't happen much. Timestamps without timezeon meaning Z vs local? ### Datatypes Slots contain lexical representations of data values. The mapping from a set of strings to the values represented by each string is called a datatype. For example, using the datatype "number" the string "3" maps to the number three, while using the datatype "string", the string "3" simply maps to itself. Datatypes are *not* used for [data parsing](#Data_Parsing). Instead, they are part of [data binding](#Data_Binding), allowing the application develop to use datatypes suitable for their environment. One implication of this is that template text must be used to encode any important datatype information, rather than reyling on the system to convey this information. This is somewhat like [duck typing](https://en.wikipedia.org/wiki/Duck_typing) in programming languages. For example: * TempScan uses integer mmhg presure measurements: "The pressure at station [] is [:int32]." * QuickWeather reads it with "The pressure at station [] is [:float]." That's fine. No problem. But if: * TempScan writes with "The pressure at station [] is [:float]." * QuickWeather reads it with "The pressure at station [] is [:int32]." then things will be fine until TempScan actually uses a non-integer value, in which case QuickWeather's datatype handler will have to either throw an error or perform a lossy conversion (perhaps by rounding). ISSUE: To avoid this issue, which could be much more serious, the template literals can encode information about datatype: * "The pressure at station [] is [:number] (64bit IEEE 754, like 2.9e+8)" This is an example of where the templates can get cumbersome, as we're embedding the relevant parts of the authoring schema. Actually, this is a bad idea. Don't do it. Because what if you want to change it later? What are you really saying? You're saying in one statement that future statements will be a certain way? Ehhh. ---- tagging data types, eg hex vs decimal, because which is '10'? #### Strings #### Numbers #### References #### Lists ### Special Slots #### id #### subject ## Validation Could go in slots, like
shc.add(TempReading, 'The temperature at station [station] was [temp, type number, min -274, max 300]C at time [timestamp].')It depends how much value there is in interchange and standardization. Is that better than?
TempReading.def = 'The temperature at station [station] was [temp]C at time [timestamp].') TempReading.validate.temp = (x) => x > -274 && x < 300 shc.add(TempReading)## Implementation Status Editor's implementation: [psdad.js](https://github.com/sandhawke/psdad.js) ## Some Issues / Ideas strict parsing, initiated by header This text contains a section which is likely to be misunderstood if not read and understood in sequential order. If any part of it is read without understanding all previous parts, the entire section may be significantly misunderstood. If you understand this, you may read the text which follows, between the start and end markers. It has been modified so that it is less likely to be accidentally read out of order; the modification is that the unicode code points of each character has been incremented; to read it, you must decrement the code points. <text begins here>[text]<text ends here>. ---- Not sure about the whitespace rules. Does it apply inside slot-text? If whitespace inside slot text is transformed, then we'll need to quote any text that happens to include a newline. What if it's two paragraphs? How would that be transmitted?? --- Lists are a datatype. Might be serialized several ways. We have a list. For input and output, each of which gets tried. eg JSON, markdown, and my [1] thing, which works better if we've lost newlines. 123,456 inus vs eu headers? key-to-understanding-0this-document? ref to key in msg? * reading is 123,456 [see notes] * reading is 123.456 [see note-7]