Revisions to IDV spec

This commit is contained in:
Tangent Wantwight 2024-08-06 21:19:45 -04:00
parent 763bfbc8cf
commit 924d8ccf48

121
idv.md
View file

@ -2,23 +2,36 @@
## Overview
The Indented Document Values (IDV) format is a text-based, whitespace-sensitive serialization format.
The Indented Document Values (IDV) format is a meta-syntax for machine-readable textual data.
IDV is designed to prioritize human readability and writability by minimizing visual noise- there are no sigils, quotes, or brackets, only colons, indentation, and (when necessary) backslash escapes.
As a tradeoff, IDV is not a self-describing data format- you have to know what type of data an IDV document represents at the time you parse it.
As a tradeoff, IDV is not a self-describing data format- while it can be used for defining a serialization or configuration format, systems using it need to layer their own semantics on top of it.
### Example
> TODO: need something both concise and nontrivial. LDAP user data is certainly an option
```
Person: Alice
Uid: 1000
Phone: 555-1234
Group: users
Group: sudo
Banner:
============================
This is my ASCII art login message
============================
Person: Bob
Uid: 1001
Phone: 555-5656
Group: users
```
## Syntax
IDV is a line-oriented format. Before any other parsing is done, the input is split into lines, and any trailing whitespace on a line (including line separators) is ignored.
> TODO: possible redraft: sequence of comments, entry headers, and documents, defined by line types (blank, comment, entry header, indented)
The lines of an IDV document represent a single flat list of Comments and Entries.
### Comments
A **Comment** is any line whose first character is a `#` character. Comment lines are for human use and are ignored by the parser.
@ -26,26 +39,98 @@ A **Comment** is any line whose first character is a `#` character. Comment line
# This line is ignored
```
An **Entry**'s first line is unindented and contains the name of a **Category**, up to the first `:` character, followed by a **Distinguisher**. All following lines with indentation, if any, are the entry's **Document**:
### Blank Lines
A **Blank Line** is any line that only contains whitespace. Because trailing whitespace is always trimmed, all Blank Lines are indistinguishable from each other.
Blank Lines are ignored unless they are part of a Document. (see below)
### Entries
An **Entry** is composed of one or more lines:
#### Tags
Each entry begins with a **Tag**, terminated by a colon (`:`). A Tag can contain any characters except leading or trailing whitespace, newlines, and colons:
```
Collection: distinguisher
Indented
document
with a blank line
Tag:
```
1. The Category and Distinguisher are both trimmed of surrounding whitespace before being interpreted, but internal whitespace is left intact.
1. Backslash unescaping is performed on the Category and Distinguisher.
#### Distinguishers
Optionally, a Distinguisher can follow the Tag on the same line. A Distinguisher can contain any characters except leading or trailing whitespace, and newlines:
```
Tag: distinguisher
```
#### Escapes
Within Tags and Distinguishers, backslash escapes may be used to represent non-permitted or inconvenient characters:
```
Tag With \: And Spaces:
Tag: \ distinguisher with leading whitespace and\nA newline
```
| Escape sequence | Replacement |
| --------------- | ----------------- |
| \\_\<space>_ | A literal space |
| \\n | A newline |
| \\: | A colon (`:`) |
| \\\\ | A backslash (`\`) |
> TODO: additional escapes? ie, hex or unicode?
#### Documents
After the first line of an entry, any indented lines make up the **Document** portion of the entry:
```
Tag: distinguisher
First Line
Second Line
Third Line
```
The first line of a Document defines the Document's indentation- subsequent lines can be indented deeper, but no line may be indented _less_ than the first line. This indentation is removed from the beginning of each line when determining the Document's value.
Blank Lines can not carry indentation information. To resolve this ambiguity, Documents may not begin or end with Blank Lines- such lines are ignored. Blank Lines that occur _between_ indented lines _are_ considered part of the Document.
```
Tag:
The above blank line is ignored.
The below blank line is part of the Document.
The below blank line is ignored.
Tag:
Other stuff
```
Backslash escapes are _not_ processed within a Document. However, backslashes may be processed later, by higher-layered semantics.
In many cases the Document will contain recursive IDV data, and the rules above are designed to play nicely with this case- but it is up to the concrete format to decide how to parse the Document. It could just as easily contain free text, XML, or a base64 blob.
#### Disambiguations:
1. The Tag and Distinguisher are both trimmed of surrounding whitespace before being interpreted, but internal whitespace is left intact.
1. The Distinguisher may contain literal colons; these are treated as regular characters and carry no special meaning.
1. The first line of a Document defines the document's indentation- subsequent lines can be indented deeper, but no line may be indented _less_ than the first line.
1. It is ambiguous whether blank lines are part of a document or just aesthetic spacing for Entries; to resolve this, blank lines before and after a Document are ignored, but internal blank lines are considered part of the Document.
1. Backslash unescaping is **not** performed on the Document. However, backslashes may be processed later, when the document is interpreted.
## Data Model
> TODO: tuples, can be interpreted according to patterns
Applying minimal interpretation, IDV data can be represented as a list of Entries.
An Entry can be represented as a 3-tuple of:
1. a string (the Tag)
2. a string (the optional Distinguisher)
3. a list of strings (the lines of the Document)
How Entries are interpreted by the appication is not specified, but see below for some suggested patterns that should line up with things people usually want to do.
## Patterns