Revisions to IDV spec

2024-08-06 21:19:45 -04:00 · 2024-08-06 21:19:45 -04:00 · 924d8ccf48
commit 924d8ccf48
parent 763bfbc8cf
1 changed files with 103 additions and 18 deletions
--- a/idv.md
+++ b/idv.md
@ -2,23 +2,36 @@

 ## Overview

-The Indented Document Values (IDV) format is a text-based, whitespace-sensitive serialization format.
+The Indented Document Values (IDV) format is a meta-syntax for machine-readable textual data.

 IDV is designed to prioritize human readability and writability by minimizing visual noise- there are no sigils, quotes, or brackets, only colons, indentation, and (when necessary) backslash escapes.

-As a tradeoff, IDV is not a self-describing data format- you have to know what type of data an IDV document represents at the time you parse it.
+As a tradeoff, IDV is not a self-describing data format- while it can be used for defining a serialization or configuration format, systems using it need to layer their own semantics on top of it.

 ### Example

-> TODO: need something both concise and nontrivial. LDAP user data is certainly an option
+```
+Person: Alice
+  Uid: 1000
+  Phone: 555-1234
+  Group: users
+  Group: sudo
+  Banner:
+    ============================
+    This is my ASCII art login message
+    ============================
+
+Person: Bob
+  Uid: 1001
+  Phone: 555-5656
+  Group: users
+```

 ## Syntax

 IDV is a line-oriented format. Before any other parsing is done, the input is split into lines, and any trailing whitespace on a line (including line separators) is ignored.

-> TODO: possible redraft: sequence of comments, entry headers, and documents, defined by line types (blank, comment, entry header, indented)
-
-The lines of an IDV document represent a single flat list of Comments and Entries.
+### Comments

 A **Comment** is any line whose first character is a `#` character. Comment lines are for human use and are ignored by the parser.

@ -26,26 +39,98 @@ A **Comment** is any line whose first character is a `#` character. Comment line
 # This line is ignored
 ```

-An **Entry**'s first line is unindented and contains the name of a **Category**, up to the first `:` character, followed by a **Distinguisher**. All following lines with indentation, if any, are the entry's **Document**:
+### Blank Lines
+
+A **Blank Line** is any line that only contains whitespace. Because trailing whitespace is always trimmed, all Blank Lines are indistinguishable from each other.
+
+Blank Lines are ignored unless they are part of a Document. (see below)
+
+### Entries
+
+An **Entry** is composed of one or more lines:
+
+#### Tags
+
+Each entry begins with a **Tag**, terminated by a colon (`:`). A Tag can contain any characters except leading or trailing whitespace, newlines, and colons:

 ```
-Collection: distinguisher
-  Indented
-  document
-
-  with a blank line
+Tag:
 ```

-1. The Category and Distinguisher are both trimmed of surrounding whitespace before being interpreted, but internal whitespace is left intact.
-1. Backslash unescaping is performed on the Category and Distinguisher.
+#### Distinguishers
+
+Optionally, a Distinguisher can follow the Tag on the same line. A Distinguisher can contain any characters except leading or trailing whitespace, and newlines:
+
+```
+Tag: distinguisher
+```
+
+#### Escapes
+
+Within Tags and Distinguishers, backslash escapes may be used to represent non-permitted or inconvenient characters:
+
+```
+Tag With \: And Spaces:
+
+Tag: \ distinguisher with leading whitespace and\nA newline
+```
+
+| Escape sequence | Replacement       |
+| --------------- | ----------------- |
+| \\_\<space>_    | A literal space   |
+| \\n             | A newline         |
+| \\:             | A colon (`:`)     |
+| \\\\            | A backslash (`\`) |
+
+> TODO: additional escapes? ie, hex or unicode?
+
+#### Documents
+
+After the first line of an entry, any indented lines make up the **Document** portion of the entry:
+
+```
+Tag: distinguisher
+  First Line
+    Second Line
+  Third Line
+```
+
+The first line of a Document defines the Document's indentation- subsequent lines can be indented deeper, but no line may be indented _less_ than the first line. This indentation is removed from the beginning of each line when determining the Document's value.
+
+Blank Lines can not carry indentation information. To resolve this ambiguity, Documents may not begin or end with Blank Lines- such lines are ignored. Blank Lines that occur _between_ indented lines _are_ considered part of the Document.
+
+```
+Tag:
+
+  The above blank line is ignored.
+  The below blank line is part of the Document.
+
+  The below blank line is ignored.
+
+Tag:
+  Other stuff
+```
+
+Backslash escapes are _not_ processed within a Document. However, backslashes may be processed later, by higher-layered semantics.
+
+In many cases the Document will contain recursive IDV data, and the rules above are designed to play nicely with this case- but it is up to the concrete format to decide how to parse the Document. It could just as easily contain free text, XML, or a base64 blob.
+
+#### Disambiguations:
+
+1. The Tag and Distinguisher are both trimmed of surrounding whitespace before being interpreted, but internal whitespace is left intact.
 1. The Distinguisher may contain literal colons; these are treated as regular characters and carry no special meaning.
-1. The first line of a Document defines the document's indentation- subsequent lines can be indented deeper, but no line may be indented _less_ than the first line.
-1. It is ambiguous whether blank lines are part of a document or just aesthetic spacing for Entries; to resolve this, blank lines before and after a Document are ignored, but internal blank lines are considered part of the Document.
-1. Backslash unescaping is **not** performed on the Document. However, backslashes may be processed later, when the document is interpreted.

 ## Data Model

-> TODO: tuples, can be interpreted according to patterns
+Applying minimal interpretation, IDV data can be represented as a list of Entries.
+
+An Entry can be represented as a 3-tuple of:
+
+1. a string (the Tag)
+2. a string (the optional Distinguisher)
+3. a list of strings (the lines of the Document)
+
+How Entries are interpreted by the appication is not specified, but see below for some suggested patterns that should line up with things people usually want to do.

 ## Patterns