Skip to content

Latest commit

 

History

History
791 lines (691 loc) · 29.5 KB

notes.org

File metadata and controls

791 lines (691 loc) · 29.5 KB

BULK

Trade-off

uniformity
as good as Lisp
generality
as good as Lisp
extensibility
decentralization
only decentralized binary format? on par with plain text formats using decentralized IDs (like URI based on decentralized DNS)
compactness
better than all general binary formats because they lack abstraction, worse than any ad hoc format (like ASN1)
processing speed
extremely simple to parse
  • easy to evaluate
  • possible to forbid evaluation for extremely restricted environments (embedded)

Syntax

markershapenotes
00nil
01(
02)
03# Int {content}
04–0Freserved
10–17BULK
18–7Frefs
80–BFw6[value]value = (marker && 3F)
C0–FF#[size] {content}size = (marker && 3F)

BULK core namespace

NSID: 0x20

0x00
( version Int Int )
0x01
true
0x02
false
0x03
( stringenc {enc}:Encoding )
0x04
( iana-charset Nat ):Encoding
0x05
( codepage Nat ):Encoding
0x06
( ns {marker}:Ref {id}:UniqueID )
0x07
( package {id}:UniqueID {namespaces} )
0x08
( import {base}:Int {count}:Int {uuid} )
0x09
( define Ref Expr )
0x0A
( mnemonic/def {name}:Expr {mnemonic}:String {doc}:Expr {value} )
0x0B
( ns-mnemonic {ns}:Expr {mnemonic}:String {doc} )
0x0C
( verifiable-ns {marker}:Int {id}:UniqueID {mnemonic}:Expr {doc}:Expr {definitions} )
  • ( mnemonic/def nil {…} ) defines the next unattributed ref (starting at 0)
0x10
( concat {a1} {a2} )
0x11
( subst {code} )
0x12
( arg Int )
0x13
( rest Int )
0x20
( unsigned-int Bytes )
0x21
( signed-int Bytes )
0x22
( fraction Int Int )
0x23
( binary-float Bytes )
0x24
( decimal-float Bytes )
0x25
( binary-fixed Nat Bytes )
0x26
( decimal-fixed Nat Bytes )
0x27
decimal2
0x30
( prefix-bytecode {bytecode} )
0x31
( prefix-bytecode* ( {arities} ) {bytecode} )
  • {arities} is a sequence of: ( Int {refs} )
0x32
( postfix-bytecode {bytecode} )
0x33
( postfix-bytecode* ( {arities} ) {bytecode} )
0x34
( arity Int {refs} )

BULK API Model

  • needed at least for hashing: access the bytes of an expression (API to get positions to read the bytes independently or API to get the bytes directly)

BARK: BULK Archive

When needed, metadata can be any expression (nil SHOULD be used to indicate no metadata).

When reading expressions as entries, ignore nil and process description forms. Entries and descriptions both are numbered (not nil).

0x00
( bark {content} )
  • can also be used as a reference after imports, to signal that the stream is a BARK stream
0x01
( entry {gbc-tag}:Expr {metadata}:Expr {content}:Expr )
  • {content} can be an array (e.g. a file’s content) or BULK expression
0x03
( description {metadata}:Expr )
  • can be inserted in many places in a BULK stream to annotate virtually anything
0x04
( metadata {data} )
0x05
( count {num} )
0x06
( about {what} )
  • {what} is a sequence of expressions, each identifying the entry
0x25
( entry {num} )
0x26
( previous {skip} )
  • within a metadata form, designates the expression before that metadata form (possibly after skipping {skip} expressions)
0x27
( next {skip} )
  • within a metadata form, designates the expression after that metadata form (possibly after skipping {skip} expressions)
0x28
everything-before
  • within a metadata form, designates the whole sequence of expressions before that metadata form
0x29
( before {marker}:Ref {skip} )
  • within a metadata form, designates the expression in the outer context of the metadata form that is before the occurrence of {marker} (possibly after skipping {skip} expressions)
    • undefined if multiple occurrences
0x2A
( after {marker}:Ref {skip} )
  • like before, but after…
0x10
gbc|
  • opaque GBC form
0x11
gbc>
  • GBC form must not be preserved if payload is modified
0x12
gbc*>
  • preservable GBC form
0x13
gbc*~>
  • preservable GBC form whose payload was modified
0x03
( bulk-stream {stream} )
0x04
( bulk-bytes {bulk}:Bytes )

-

0x30
( compressed gbc| {method}:MeTOD Bytes )
0x31
deflate
0x32
deflate64
0x33
lzma
0x34
lzma2
Ox35
bz2
0x36
lzw
0x37
lzo
0x3D
( hash {signature}:Expr )
0x3E
( hashed gbc> {signature}:Expr Expr )
0x3F
( encrypted gbc| {method} Bytes )
  • tar semantics
    • metadata
      Ox40
      ( path {components} )
  • by design, there is no way to express an absolute FS path
    • an application is free to define insecure forms to express absolute paths and links
    • TODO: what if a component contain “/”?
      • implementation should not resolve the name but look it up in the directory entries (that takes care of “/” but not of a “..” entry, this still needs checking, shame on Unix)
        0x41
        ( user {name} )
  • {name} can be anything, incl. string and Nat
    • multiple entries (e.g. “pierre”/1000)
      0x42
      ( group {name} )
      0x43
      contiguous
      0x44
      ( access {time} )
      0x45
      ( modification {time} )
      0x46
      ( change {time} )
      0x47
      ( mode {mode} )
      0x48
      ( posix-acl {acls} )
0x49
( user {id} {mode} {default?} )
0x4A
( group {id} {mode} {defaults?} )
0x4B
( other {mode} {defaults?} )
0x4C
( mask {mode} {defaults?} )
0x4D
( xattr {xattr} )
  • {xattrs} = ( {name} {value} )+
    Ox4E
    ( offsets Nat+ )
  • meant for forms not containing individual entries’ metadata
  • TODO: base?
    0x4F
    ( offset Nat )
  • meant for forms grouping an entry with its metadata
  • TODO: base?
    • entry
      • an array as an entry (possibly within GBC forms) is presumed to be a regular file
      0x50
      ( hard-link Path )
      0x51
      ( sym-link Path )
      0x52
      ( char-dev {major}:Int {minor}:Int )
      0x53
      ( block-dev {major}:Int {minor}:Int )
      0x54
      directory
      0x55
      fifo
      0x56
      ( sparse-file {segments} )
  • Bytes
0x57
( hole {size}:Nat )
  • {size} in bytes
  • gzip semantics
    0x60
    ( binary Boolean )
    0x61
    ( comment Expr )
    0x62
    ( os Expr )
    • vocabularies may provide additional expressions for OSes
    0x70
    FAT file system (DOS, OS/2, NT) + PKZIPW 2.50 VFAT, NTFS
    0x71
    Amiga
    0x72
    VMS (VAX or Alpha AXP)
    0x73
    Unix
    0x74
    VM/CMS
    0x75
    Atari
    0x76
    HPFS file system (OS/2, NT 3.x)
    0x77
    Macintosh
    0x78
    Z-System
    0x79
    CP/M
    0x7A
    TOPS-20
    0x7B
    NTFS file system (NT)
    0x7C
    SMS/QDOS
    0x7D
    Acorn RISC OS
    0x7E
    VFAT file system (Win95, NT)
    0x7F
    MVS (code also taken for PRIMOS)
    0x80
    BeOS (BeBox or PowerMac)
    0x81
    Tandem/NSK
    0x82
    THEOS
    0x63
    maximum-compression
    0x64
    fastest-comœpression
    0x83
    ( acorn-bbc-mos-file-type-info Bytes )
    0x84
    ( apollo-file-type-info Bytes )
    0x85
    ( cpio-compressed Bytes )
    • gzsig extra field should be created from a compatible cryptographic signature
    0x86
    ( keynote-assertion Bytes )
    0x88
    ( macintosh-info Bytes )
    0x89
    ( acorn-file-type-info Bytes )
  • dar semantics
    • split archives
      • advertised in container metadata
0x90
( split-archive {archive-id} {member-id} {members} )
  • members are description forms that MAY contain filename or hash
    • FS-specific attributes
    • incremental backup?
    • fast member extract? (how does DAR does that?)

One could define a whole namespace of compact versions, like

about-num ⇔ ( lambda n ( about ( entry n ) ) )
about-previous ⇔ ( about ( previous ) )
about-previous* ⇔ ( lambda n ( about ( previous n ) ) )
about-num[3] ⇔ ( about ( entry 3 ) )
about-previous[2] ⇔ ( about ( previous 2 ) )

BUlk possibly-Zipped archive (.buz)

A foobar.buz archive with multiple files:

( pack ( metadata ( count 2 ) )
  ( described gbc*> ( metadata ( path "foo.txt" ) )
    ( compressed gbc| lzma {foo.txt}:Bytes ) )
  ( described ( metadata ( path "bar.jpg" ) )
    {bar.jpg}:Bytes ) )

A foo.txt.buz archive with a single file and hash for integrity:

( described ( metadata ( path "foo.txt" ) )
  ( hashed gbc> ( sha3 {hash}:Bytes )
    ( compressed gbc| lzma {foo.txt} ) ) )

A manifest.buz manifest about other files:

( description ( metadata ( ( about ( path "foo.txt" ) )
                           ( hash ( md5 {hash}:Bytes ) ) ) ) )
( description ( metadata ( ( about ( path "bar.txt" ) )
                           ( hash ( md5 {hash}:Bytes ) ) ) ) )
( description ( metadata ( ( about ( path "baz.iso" ) )
                           ( hash ( sha3 {hash}:Bytes ) ) ) ) )

BARK utility

$ bark list <file>
$ bark check <file>
$ bark extract <file> [<members>]

convert

$ barf convert --to gzip <file>
$ barf convert --from dar <file>

This command convert from and to BULK. Converting to and then from BULK should produce a file at least semantically identical, (it may be bytewise identical, and it might be an implementation goal to achieve that, but no metadata is stored to that end by default).

  • mode of operation
    lossless
    refuse conversion if semantic information would be lost (i.e. if a string is not encodable in the target format, but not if random padding is present)
    lossy
    not lossless (i.e. a one-member tar archive converted to BULK might then be converted to gzip, at the price of losing ACLs)
    transform
    change data representation to fit target format (i.e. if target is gzip, LZMA data would be recompressed to deflate, a UTF-8 string encoded in ISO-8859-1)
    maintain
    refuse conversion if data representation in the source format doesn’t fit target format
  • should never need to refuse if BULK is target?
    • default is lossless transform

    Targets:

    • manifests
      • SFV
    • compression formats

BARK Object Model

  • access to metadata
    • consolidated metadata when forms overwrite each other?
  • API for history?
    • access to entries
      • across manifests/packs/stacks within a common context
    • ability to add entries/metadata while not breaking hashes
      • when hash is recomputable:
  • app knows algo/has all data to hash (key, etc…)
  • modify/delete/append in place
  • rehash
    • when hash is not recomputable:
  • app doesn’t know algo/lacks some data
  • modify/delete raise error
  • append after original data

Comparison

  • tar
    • +compression
  • zip
  • XZ
    • has a limited choice of compression/hash
  • gzip
  • cpio, pax

BULK stream with size

BULK cannot contain a form like ( bulk/size {size}:Nat {bulk} ) because size could be erroneous and then parsing the whole stream or skipping {bulk} would give two different results (or more if {bulk} contains other such erroneous forms.

This could represent a security risk, with some parsers not seeing an issue and others triggering it.

Lambda expressions

( verifiable-ns 24 {id} nil "λ"
"This vocabulary can be used to represent functions that can be evaluated."

( mnemonic/def nil "lambda" "( lambda {var}:Ref {body} )" )

( define 0x18FF "This reference is intended to be used as lambda function variable." )
( mnemonic/def nil "a" 0x18FF )
( mnemonic/def nil "b" 0x18FF )
( mnemonic/def nil "c" 0x18FF )
( mnemonic/def nil "d" 0x18FF )
( mnemonic/def nil "e" 0x18FF )
( mnemonic/def nil "f" 0x18FF )
( mnemonic/def nil "g" 0x18FF )
( mnemonic/def nil "h" 0x18FF )
( mnemonic/def nil "i" 0x18FF )
( mnemonic/def nil "j" 0x18FF )
( mnemonic/def nil "k" 0x18FF )
( mnemonic/def nil "l" 0x18FF )
( mnemonic/def nil "m" 0x18FF )
( mnemonic/def nil "n" 0x18FF )
( mnemonic/def nil "o" 0x18FF )
( mnemonic/def nil "p" 0x18FF )
( mnemonic/def nil "q" 0x18FF )
( mnemonic/def nil "r" 0x18FF )
( mnemonic/def nil "s" 0x18FF )
( mnemonic/def nil "t" 0x18FF )
( mnemonic/def nil "u" 0x18FF )
( mnemonic/def nil "v" 0x18FF )
( mnemonic/def nil "w" 0x18FF )
( mnemonic/def nil "x" 0x18FF )
( mnemonic/def nil "y" 0x18FF )
( mnemonic/def nil "z" 0x18FF )

( mnemonic/def nil "id" "Somestimes a form is needed just to add a semantic aspect to an expression without actually changing its value for most purposes. For these cases, a reference can be given the value of id. Some processing applications will substitute their own evaluation to this one to implement that semantic." ( lambda x x ) )
)

Useful vocabularies

RDF 1.0

TODO: update to RDF 1.1 or 1.2

0x01
uriref ⇔ λ:id
0x02
( base Bytes )
0x03
prefix ⇔ ( uri( lambda u ( lambda s ( concat u s ) ) )
0x04
rdf# ⇔ ( uriref ”http://www.w3.org/1999/02/22-rdf-syntax-ns#” )
0x05
blank
0x06
( plain {lang} {literal} )
0x07
( datatype {id}:URIRef {literal} )
0x08
xmlliteral ⇔ ( rdf# “XMLLiteral” )
0x09
( triples {triples} )
0x0A
( turtle {statements} )
0x0B
type ⇔ ( rdf# “type” )
0x0C
property ⇔ ( rdf# “Property” )
0x0D
statement ⇔ ( rdf# “Statement” )
0x0E
subject ⇔ ( rdf# “subject” )
0x0F
predicate ⇔ ( rdf# “predicate” )
0x10
object ⇔ ( rdf# “object” )
0x11
bag ⇔ ( rdf# “Bag” )
0x12
seq ⇔ ( rdf# “Seq” )
0x13
alt ⇔ ( rdf# “Alt” )
0x14
value ⇔ ( rdf# “value” )
0x15
list ⇔ ( rdf# “List” )
0x16
nil ⇔ ( rdf# “nil” )
0x17
first ⇔ ( rdf# “first” )
0x18
rest ⇔ ( rdf# “rest” )
0x19
plainliteral ⇔ ( rdf# “PlainLiteral” )

-

0x20
this-resource
0x21
uri

Differences between complete triples (3s) and turtle-like (Tl)

In 3s, a single triple cannot cost less than 8 bytes:

(:A:B:C)

For big graphs of mostly known references, this can already be a valuable improvement. {triples} could be a packed sequence without markers around triples, but that would mean that a single missing or superfluous expression would wreck everything that’s after it. The fact that a triple is still a form limits the savings but keeps a level of robustness (but it would be possible to define a packing RDF form…).

Adding another triple cannot cost less than adding 8 bytes:

(:A:B:C)(:A:B:D)

In Tl, a standalone triple cannot cost less than 10 bytes:

(:A(:B:C))

But adding another triple can cost as few as 2 bytes:

(:A(:B:C:D))

XML

XML is pretty complex, but most of it is unused (some even advised not to be used, i.e. unparsed entity). The vocabulary can be split into loosely coupled parts:

  • document
  • DTD
  • schema
  • Relax NG

Document

XML content, not notation: no support for entities or CDATA. stringenc can be used everywhere.

( define ?rfc ( subst ( pi "rfc" ( rest 0 ) ) ) )
( verifiable-ns 0x1800 {id} nil "xml"
"This vocabulary can be used to represent XML data."

( mnemonic/def nil "xml1.0" "( xml1.0 {content} )"
( mnemonic/def nil "xml1.1" "( xml1.1 {content} )"
( mnemonic/def nil "pi" "( pi {target} {content} )"
( mnemonic/def nil "comment" "( comment {content} )"
( mnemonic/def nil "element" "( element {name} {content} )"
( mnemonic/def nil "attribute" "( attribute {name} {value} )"
( mnemonic/def nil "xml:" ( rdf:prefix "http://www.w3.org/XML/1998/namespace" )
( mnemonic/def nil "xmlns:" "xmlns:" ( rdf:prefix "http://www.w3.org/2000/xmlns/" )
( mnemonic/def nil "preserve" "preserve" ( attribute ( xml: "space" ) "preserve" ) )

Default package?

RDF + Simple XML ( + XPath )

XPath namespace

( verifiable-ns 0x1800 {id} nil "xpath"
"This vocabulary can be used to represent XPath expressions."

( mnemonic/def nil "xpath" "( xpath {steps} )" )
( mnemonic/def nil "union" "( union {exprs} )" )
( mnemonic/def nil "step" "( step {axis} {test} {preds} )" )
( mnemonic/def nil "ancestor" nil )
( mnemonic/def nil "ancestor-or-self" nil )
( mnemonic/def nil "attribute" nil )
( mnemonic/def nil "child" nil )
( mnemonic/def nil "descendant" nil )
( mnemonic/def nil "descendant-or-self" nil )
( mnemonic/def nil "following" nil )
( mnemonic/def nil "following-sibling" nil )
( mnemonic/def nil "namespace" nil )
( mnemonic/def nil "parent" nil )
( mnemonic/def nil "preceding" nil )
( mnemonic/def nil "preceding-sibling" nil )
( mnemonic/def nil "self" nil )
( mnemonic/def nil "node()" nil )
( mnemonic/def nil "text()" nil )
( mnemonic/def nil "comment()" nil )
( mnemonic/def nil "pi()" "pi() or ( pi() {name}:String )" )
( mnemonic/def nil "pi()" nil )

( mnemonic/def nil "." "" ( step self node() ) )
( mnemonic/def nil ".." "" ( step parent node() ) )
( mnemonic/def nil "//" "" ( step descendant-or-self node() ) )

( mnemonic/def nil "step*" "" ( λ:lambda λ:a ( step λ:a node() ) ) )


)

As a Step, {name}:QName ⇔ ( step child {name} ) ?

QName

To maximize reuse between namespaces, URIRef and URIString expressions also have the type QName. Any Bytes whose content satisfy the NCName production also has.

MeTOD: Media Type Optimal Description

  • type as ref or form
  • atomic type
    • html5
    • jpeg
  • composite type
    • syntax: ( main-type {params} )
    • example: xml
      • ( xml xhtml rdf )
  • meaning xml with xhtml document root and rdf elements
    • xhtml* = ( subst ( xml xhtml ( rest 0 ) ) )
  • ( xhtml* mathml svg )
    • some MeTOD types may only make sense as sub-types
      • e.g. xml NS that doesn’t have a document element
        • like dublin-core: ( xml xhtml svg dublin-core )
    • encoding as first-class type
      • ( gzip tar )
      • ( base64 zip )
    • complex structures?
      • ( mime ( alternatives ( qp ( text utf-8 ) ) ( qp html5 ) ) ( base64 zip ) ( signature ( base64 openpgp ) ) )
    • accept patterns
      • ( xml * )
      • ( xml xhtml * )
    • semantics dictated by type
      • for xml, the first subtype MUST be the type for the document element
      • for MIME, order of subtypes is order of parts
0x00
( type {type}:Expr )
  • metadata form
0x02
*
0x03
bulk / ( bulk {namespaces} )
0x10
( text {encoding} {subtype} )
  • means that it can be shown as-is to a user
  • ( text {encoding} ) means plain text
  • ( text utf-8 markdown )
  • ( text utf-8 ( source-code c++ ) )
0x11
( source {langs} )
  • only as subtype of text, means that if available, a code editor should be used to view this format
0x12
( image {subtype} )
0x13
( audio {subtype} )
0x14
( video {subtype} )

MeTOD only defines kinds where a default software could be expected to process many or most types of this kind. This is not the case for MIME registries application, text, message, model, multipart and text. But a MIME vocabulary could define them.

Dates namespace

  • Nat123 := Nat | Nat Nat | Nat Nat Nat
  • NatsF := Nat* ( Float | Nat )
  • Time = Date | TimeOfDay
0x00
( calendar Nat123 )
0x01
( weekdate Nat123 )
0x02
( ordinal Nat Nat )
0x03
( time NatsF )
0x04
( point Date TimeOfDay )
0x05
( zulu Time )
0x06
( offset TimeOfDay Time )
0x07
( years NatsF )
0x08
( months NatsF )
0x09
( days NatsF )
0x0A
( hours NatsF )
0x0B
( minutes NatsF )
0x0C
( seconds NatsF )
0x0D
( weeks Nat )
0x0E
( interval {exprs} )
  • {exprs} = Time Time | Duration Time | Time Duration | Duration
0x0F
( repeat Nat Interval ) / ( repeat Interval )
???
( julian Number )
???
( anno-mundi Nat123 )
???
( anno-hegirae Nat123 )
???
( unix SInt )
???
( ntp Word )
???
( tai64 Word64 )
???
( tng-stardate Nat Nat )

Hash

   ( verifiable-ns 0x1800 {id} nil "hash"
   "The forms in this vocabulary can be used to represent hashes along with the hashing algorithm instead of using an unmarked byte sequence. When an algorithm has other inputs than the message, they can be provided after the hash itself as a property list.

When an algorithm can produce hashes in different sizes and the size used is a number of bits divisible by 8, the size property should be omitted from the property list and inferred by the processing application from the size of the BULK expression (e.g. `( sha3 #[24] {hash}:24B )` is a 196-bits SHA3 hash).

As a rule, each of these forms can contain `nil` as a first expression to denote not a hash but a choice of configuration in some application context. For example, `( uuid nil prepend {ns}:Bytes )` could mean that subsequent v3 and v5 UUIDs will be produced with {ns} as UUID namespace."

   ( mnemonic/def nil "bsd" "( bsd Bytes )" )
   ( mnemonic/def nil "sysv" "( sysv Bytes )" )
   ( mnemonic/def nil "crc" "( crc Bytes )" )
   ( mnemonic/def nil "fletcher" "( fletcher Bytes {config} )" )
   ( mnemonic/def nil "adler32" "( adler32 Bytes )" ( λ:lambda λ:h ( fletcher λ:h key 65521 ) ) )
   ( mnemonic/def nil "pjwhash" "( pjwhash Bytes )" )
   ( mnemonic/def nil "elfhash" "( elfhash Bytes )" )

   ( mnemonic/def nil "murmur1" "( murmur1 Bytes )" )
   ( mnemonic/def nil "murmur2" "( murmur2 Bytes )" )
   ( mnemonic/def nil "murmur2a" "( murmur2a Bytes )" )
   ( mnemonic/def nil "murmur64a" "( murmur64a Bytes )" )
   ( mnemonic/def nil "murmur64b" "( murmur64b Bytes )" )
   ( mnemonic/def nil "murmur3" "( murmur3 Bytes )" )

   ( mnemonic/def nil "umac" "( umac Bytes {config} )" )
   ( mnemonic/def nil "vmac" "( vmac Bytes {config} )" )

   ( mnemonic/def nil "uuid" "( uuid Bytes {config} )" )
   ( mnemonic/def nil "md2" "( md2 Bytes )" )
   ( mnemonic/def nil "md4" "( md4 Bytes )" )
   ( mnemonic/def nil "md5" "( md5 Bytes )" )
   ( mnemonic/def nil "md6" "( md6 Bytes {config} )" )
   ( mnemonic/def nil "ripemd" "( ripemd Bytes )" )
   ( mnemonic/def nil "haval" "( haval Bytes )" )
   ( mnemonic/def nil "gost" "( gost Bytes )" )
   ( mnemonic/def nil "sha1" "( sha1 Bytes )" )
   ( mnemonic/def nil "sha2" "( sha2 Bytes )" )
   ( mnemonic/def nil "sha3" "( sha3 Bytes )" )
   ( mnemonic/def nil "tiger" "( tiger Bytes )" )
   ( mnemonic/def nil "tiger2" "( tiger2 Bytes )" )
   ( mnemonic/def nil "whirlpool" "( whirlpool Bytes )" )
   ( mnemonic/def nil "blake" "( blake Bytes )" )
   ( mnemonic/def nil "blake2" "( blake2 Bytes )" )

   ( mnemonic/def nil "size" )
   ( mnemonic/def nil "prepend" )
   ( mnemonic/def nil "append" )
   ( mnemonic/def nil "key" )
   ( mnemonic/def nil "salt" )
   ( mnemonic/def nil "rounds" )

   )

Encryption

  • blowfish?
  • camellia?
  • twofish?
  • AES?
  • serpent?
  • openpgp?

Asking input

Implementation notes

Semantics beyond definitions

When implementing a processing application that gives semantics beyond the evaluation of expressions, to benefit from all possible evaluations, the application should just replace relevant prior definitions with its own implementation while evaluating the BULK streams.

Bootstrapping a hashing vocabulary

  • the problem is that this vocabulary provides hashes before any way of expressing a hash is possible, so its own hash is expressed with a name inside the vocabulary
  • you read ( ns 0x1800 ( 0x182E #[8] {hashID}:8B ) )
  • how do you get to the point where you know 0x182E is hash:sha3?
    • you get the list (hopefully with only one element) of vocabularies identified by a form whose sole element is a 64-bits word {hashid}
    • for each of them, you check if 0x2E is a name associated with a hashing algorithm
      • if yes, you check if that hash matches the definition

Minimal BULK

( version 1 0 ) ( ns 24 ( 24:sha3 #[8] 8B ) ) ( ns 25 ( 24:sha3 #[8] 8B ) ) 25:type #[0–63] {content}
|<---- 6 ---->| |<----------- 18 ---------->| |<----------- 18 ---------->| |<---- 3 ---->|
( version 1 0 ) ( ns 24 ( 24:sha3 #[8] 8B ) ) ( ns 25 ( 24:sha3 #[8] 8B ) ) 25:type # {size} {content}
|<---- 6 ---->| |<----------- 18 ---------->| |<----------- 18 ---------->| |<- 3 ->|  2/3/5/9

When a profile is known (like a specific file extension for typed blobs):

( version 1 0 ) 24:type #…
|<---- 8 ---->| |<- 3 ->|
contentBULK overheadwith profile
63B45B11B
255B47B13B
65kB48B14B
4GB50B16B
18EB54B20B

The power of abstraction

Explicit relationship between similar data

When a protocol makes it possible to express several different data that are related but different in structure, most other formats can only express those relationships in human readable documentation.

Example: an event format that includes information about the person doing the action and the person logging it:

{ "eventType": "WallPainted",
  "eventDate": "20210723T235601Z",
  "wallId": "d9e62839-dafe-49e1-b4d2-ce99c035fa9f",
  "logged": { "by": "alice", "date": "20210723T090145Z" },
  "painted": { "by": "bob", "date": "20210722T193545Z" }
}

When the date for the action or logging is the same as the event, it might be omitted:

{ "eventType": "WallPainted",
  "eventDate": "20210723T235601Z",
  "wallId": "d9e62839-dafe-49e1-b4d2-ce99c035fa9f",
  "logged": { "by": "alice" },
  "painted": { "by": "bob" }
}

And the fomat might include a shortcut for those cases:

{ "eventType": "WallPainted",
  "eventDate": "20210723T235601Z",
  "wallId": "d9e62839-dafe-49e1-b4d2-ce99c035fa9f",
  "loggedBy": "alice",
  "paintedBy": "bob"
}

With almost all existing formats, there is not way to convey the relationship between loggedBy and logged.

But in BULK, loggedBy can be explicitly defined in terms of logged:

( verifiable-ns 0x1800 {id} nil "wall"
"This vocabulary can be used to represent wall painting events."

( mnemonic/def nil "event" "( event {properties} )"
( mnemonic/def nil "type" "( type Expr )"
( mnemonic/def nil "date" "( date Expr )"
( mnemonic/def nil "wallId" "( comment {content} )"
( mnemonic/def nil "logged" "( logged {properties} )"
( mnemonic/def nil "painted" "( painted {properties} )"
( mnemonic/def nil "by" "( by {person} )"
( mnemonic/def nil "wallPaintedEvent" "( wallPaintedEvent {properties} )" ( subst ( event ( type "WallPainted" ) ( rest 0 ) ) )
( mnemonic/def nil "loggedBy" "( loggedBy {person} )" ( subst ( logged ( by ( arg 0 ) ) ) )
( mnemonic/def nil "paintedBy" "( paintedBy {person} )" ( subst ( painted ( by ( arg 0 ) ) ) )

Which means that the application processing this format doesn’t even need to know about loggedBy and paintedBy because BULK evaluation will transform them away:

( wallPaintedEvent
  ( date "20210723T235601Z" )
  ( loggedBy "alice" )
  ( paintedBy "bob" ) )

will get evaluated into:

( event
  ( type "WallPainted" )
  ( date "20210723T235601Z" )
  ( logged
    ( by "alice" ) )
  ( painted
    ( by "bob" ) ) )

Forward compatibility

BULK makes it possible to create new versions of vocabularies that encompass previous versions, in a way that minimizes implementation complexity. Whenever a new version of a vocabulary is used, an application can use an alternative definition for the previous version that maps it to the new version.

Example: let’s define an extremely limited BARK with just manifests and SHA-3:

( verifiable-ns 0x1800 {id} nil "bark alpha"
"This vocabulary can be used to represent manifests."

( mnemonic/def nil "description" "( event {properties} )"