https://github.com/pyrustic/paradict Skip to content Toggle navigation Sign in * Product + Actions Automate any workflow + Packages Host and manage packages + Security Find and fix vulnerabilities + Codespaces Instant dev environments + Copilot Write better code with AI + Code review Manage code changes + Issues Plan and track work + Discussions Collaborate outside of code Explore + All features + Documentation + GitHub Skills + Blog * Solutions For + Enterprise + Teams + Startups + Education By Solution + CI/CD & Automation + DevOps + DevSecOps Resources + Learning Pathways + White papers, Ebooks, Webinars + Customer Stories + Partners * Open Source + GitHub Sponsors Fund open source developers + The ReadME Project GitHub community articles Repositories + Topics + Trending + Collections * Pricing Search or jump to... Search code, repositories, users, issues, pull requests... Search [ ] Clear Search syntax tips Provide feedback We read every piece of feedback, and take your input very seriously. [ ] [ ] Include my email address so I can be contacted Cancel Submit feedback Saved searches Use saved searches to filter your results more quickly Name [ ] Query [ ] To see all available qualifiers, see our documentation. Cancel Create saved search Sign in Sign up You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session. You switched accounts on another tab or window. Reload to refresh your session. Dismiss alert {{ message }} pyrustic / paradict Public * Notifications * Fork 0 * Star 3 Streamable multi-format serialization with schema License MIT license 3 stars 0 forks Activity Star Notifications * Code * Issues 0 * Pull requests 0 * Actions * Projects 0 * Security * Insights Additional navigation options * Code * Issues * Pull requests * Actions * Projects * Security * Insights pyrustic/paradict This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository. master Switch branches/tags [ ] Branches Tags Could not load branches Nothing to show {{ refName }} default View all branches Could not load tags Nothing to show {{ refName }} default View all tags Name already in use A tag already exists with the provided branch name. Many Git commands accept both tag and branch names, so creating this branch may cause unexpected behavior. Are you sure you want to create this branch? Cancel Create 1 branch 0 tags Code * Local * Codespaces * Clone HTTPS GitHub CLI [https://github.com/p] Use Git or checkout with SVN using the web URL. [gh repo clone pyrust] Work fast with our official CLI. Learn more about the CLI. * Open with GitHub Desktop * Download ZIP Sign In Required Please sign in to use Codespaces. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching GitHub Desktop If nothing happens, download GitHub Desktop and try again. Launching Xcode If nothing happens, download Xcode and try again. Launching Visual Studio Code Your codespace will open once ready. There was a problem preparing your codespace, please try again. Latest commit @pyrustic pyrustic Update ... 6e9a41c Dec 18, 2023 Update 6e9a41c Git stats * 1 commit Files Permalink Failed to load latest commit information. Type Name Latest commit message Commit time docs/modules Update December 18, 2023 16:58 paradict Update December 18, 2023 16:58 tests Update December 18, 2023 16:58 .gitignore Update December 18, 2023 16:58 LICENSE Update December 18, 2023 16:58 MANIFEST.in Update December 18, 2023 16:58 README.md Update December 18, 2023 16:58 VERSION Update December 18, 2023 16:58 backstage.tasks Update December 18, 2023 16:58 pyproject.toml Update December 18, 2023 16:58 setup.cfg Update December 18, 2023 16:58 setup.py Update December 18, 2023 16:58 View code [ ] Paradict Table of contents Overview A rich set of datatypes An extension mechanism A multi-format solution A validation mechanism An intuitive API And more... Paradict textual format: Why not JSON, YAML, or TOML ? Paradict binary format: Why not Protobuf, MessagePack, or CBOR ? Code snippets for everyday scenarios Binary representation of data Textual representation of data Working with config files Paradict datatypes Data format specification Textual format Data mode Config mode Binary format Application programming interface Textual serialization Using the Encoder class Using the Decoder class Using the encode function Using the decode function Using the document abstraction Using the Document class Using the FileDoc class Using the ConfigFile class Miscellaneous functions Binary serialization Using the Packer class Using the Unpacker class Using the pack function Using the unpack function Load and dump Miscellaneous functions Type customization Continuous data stream processing Textual stream Binary stream Paradict schema for data validation Miscellaneous Testing and contributing Setup your development environment Installation Create and activate a virtual environment Install for the first time Upgrade the package Deactivate the virtual environment About the author README.md License: MIT PyPI package version Downloads Cover image A Paradict config document Paradict Streamable multi-format serialization with schema This project is part of the Pyrustic Open Ecosystem. Table of contents * Overview * Paradict textual format: Why not JSON, YAML, or TOML ? * Paradict binary format: Why not Protobuf, MessagePack, or CBOR ? * Code snippets for everyday scenarios * Paradict datatypes * Data format specification * Application programming interface + Textual serialization + Binary serialization + Type customization * Continuous data stream processing * Paradict schema for data validation * Miscellaneous * Testing and contributing * Installation Overview Paradict is a multi-format serialization solution for serializing and deserializing a dictionary data structure in bulk or in a streaming fashion. It comes with a data validation mechanism as well as other cool stuff, and its eponymous reference library is a Python package available on PyPI. A rich set of datatypes A Paradict dictionary can be populated with strings, binary data, integers, floats, complex numbers, booleans, dates, times, datetimes, comments, extension objects, and grids (matrices). Although Paradict's root data structure is a dictionary, lists, sets, and dictionaries can be nested within it at arbitrary depth. An extension mechanism Paradict has an extension mechanism that works with two components: * extension object: dictionary-based structures defined in Paradict data (in textual or binary format). * object builder: Python callable (passed to deserializer) that takes an extension object as input, consumes its contents, builds and returns a new Python object. A multi-format solution Paradict offers binary and textual representations for a compatible arbitrary dictionary data structure. The human-readable format has two modes, a data-mode for bidirectional mapping to binary format, and a config-mode, with lighter syntax, suitable for configuration files. A validation mechanism Data validation is performed against a schema which is itself just another dictionary. The schema can be defined in a file with an arbitrary data format (Paradict, JSON, etc.) or programmatically. Basically, a schema describes the expected keys in the target dictionary and the expected data types of their values. When defined programmatically, the schema allows the programmer to validate the target dictionary with arbitrary rules by incorporating checker callbacks. An intuitive API The library API is designed to be simple to understand, intuitive and powerful. There are four fundamental classes: Encoder, Decoder, Packer, and Unpacker, which serialize and deserialize data iteratively. On top of these classes, four functions namely encode, decode, pack, and unpack do the same thing but in bulk. Then there are additional classes and functions to perform various tasks such as TypeRef class for customizing types, ConfigFile class for configuration files, load, and dump functions for reading and writing Paradict binary files, etc. And more... There's more to say about Paradict that can't fit in this Overview section. In the following sections, we'll dig deeper into Paradict, but first, why not JSON, YAML, TOML, Protobuf, MessagePack, or CBOR ? Back to top Paradict textual format: Why not JSON, YAML, or TOML ? With its textual format, Paradict is de-facto alternative to JSON, YAML, and TOML. Although these three formats are all human-readable, they serve different purposes. For example, TOML is specifically designed for configuration files while JSON is used as a data interchange format. Having two modes (data-mode and config-mode) for its textual format makes Paradict an interesting solution that targets the different purposes of JSON, YAML, and TOML. Paradict, while offering a binary representation of its textual format, does also reject complexity and ambiguity as it can be found on YAML, has a great extension mechanism and a rich set of datatypes. Back to top Paradict binary format: Why not Protobuf, MessagePack, or CBOR ? With its binary format, Paradict is de-facto alternative to Protobuf, MessagePack, and CBOR. However, choosing a binary format requires careful consideration as its strengths and weaknesses are not as readily discernible as in the case of a textual format. Therefore, this section can be expected to offer comprehensive benchmarking and comparison details on different serialization solutions. Nonetheless, given the potential bias of benchmarking toward a desired outcome, let us only point out that, unlike others, Paradict provides bidirectional mapping between its textual and binary formats. The surge in LLM adoption is a reminder that people value advanced machine interfaces and intuitive data representation, despite extra compute costs. Back to top Code snippets for everyday scenarios Following are working code snippets for everyday scenarios. Binary representation of data Pack and unpack: from paradict import pack, unpack my_dict = {0: 42} # serialize my_dict bin_data = pack(my_dict) # test assert my_dict == unpack(bin_data) Read and write a file: from datetime import datetime from paradict import load, dump path = "/home/alex/test/user_card.bin" user_card = {"name": "alex", "id": 42, "group": "admin", "birthday": datetime(2020, 1, 1, 4, 20, 59)} # serialize user_card then dump it into the file dump(user_card, path) # deserialize user_card from the file data = load(path) # test assert user_card == data The code snippet above will serialize the user_card dictionary then dump it into the file "user_card.bin". The file would contain 43 bytes as following: from paradict import stringify_bin path = "/home/alex/test/user_card.bin" with open(path, "rb") as file: data = file.read() print(stringify_bin(data)) Output: \x01\x45\x6e\x61\x6d\x65\x45\x61\x6c\x65\x78\x43\x69\x64\xc5\x46\x67\x72\x6f\x75\x70\x46\x61\x64\x6d\x69\x6e\x49\x62\x69\x72\x74\x68\x64\x61\x79\x19\x9b\x2f\x2b\x3d\xa4\xff Textual representation of data Encode and decode: from paradict import encode, decode my_dict = {0: 42} # serialize my_dict txt_data = encode(my_dict) # test assert my_dict == decode(txt_data) Working with config files Create and interact with a configuration file: from datetime import datetime from paradict import ConfigFile, box path = "/home/alex/test/app_settings.dict" # data for the 'user' section user_card = {"name": "alex", "id": 42, "group": "admin", "birthday": datetime(2020, 1, 1, 4, 20, 59)} # data for the 'gui' section gui_config = {box.CommentID(): box.Comment("Exotic fonts are banned !"), "font_family": "Arial", "background": "black", "dimensions": {"width": 42.0, "height": 3.14}} confile = ConfigFile(path) confile.set("user", user_card) confile.set("gui", gui_config) # few hours later... confile = ConfigFile(path) # test assert user_card == confile.get("user") The code snippet above will create a config file then fill it with: [user] name = "alex" id = 42 group = "admin" birthday = 2020-01-01T04:20:59 [gui] # Exotic fonts are banned ! font_family = "Arial" background = "black" dimensions = (dict) width = 42.0 height = 3.14 Back to top Paradict datatypes Following are Paradict datatypes for both textual and binary formats: * dict: dictionary data structure * list: list data structure * set: set data structure * obj: object type for extension * grid: grid data structure for storing matrix-like data * bool: boolean type (true and false) * str: string type with unicode escape sequences support * raw: raw string without unicode escape sequences support * comment: comment datatype * bin: binary datatype * int: integer datatype * float: float datatype * complex: complex number * datetime: ISO 8601 datetime (with time offsets) * date: ISO 8601 date * time: ISO 8601 time (with time offsets) Paradict supports null for representing the intentional absence of any value. For the dictionary data structure, Paradict allows keys to be either strings or numbers. However, in the config mode of the textual format, keys should only be strings. Also, a multiline string is tagged as (text) in the textual format, and it spans over multiple lines. Back to top Data format specification This section is just an overview of the binary and the textual Paradict formats. For more information, consult txt_paradict_spec.md and bin_paradict_spec.md. Textual format At the high level of the textual representation is the message which represents a dictionary data structure and at the low level is the line of text. A line of text can represent either complete data, such as a number, or a portion of some data that spans multiple lines, such as a multiline string. For human readability, data expected to span multiple lines is first introduced with a tag (the data type in parentheses) under which the data is placed with the correct number of 4-space indents. The format comes with two modes, the data mode and the config mode. These modes differ based on the data type of dictionary keys and the character utilized to separate each key from its corresponding value. Data mode The data mode formally represents data (bidirectional mapping to binary format). It allows strings and numbers as keys and use a colon as separator between a key and its value. # this is a comment "my key": "Hello World" Config mode The config mode is only for configuration files. It only allows strings as key, removing the need to surround them with quotes, and also uses the equal sign as separator between a key and its value. # this is a comment my_key = "Hello World" Read the full specification in txt_paradict_spec.md ! Binary format At the high level of the binary representation is the message which represents a dictionary data structure and at the low level is the datum which is often a 2-tuple composed of a tag and its payload which may be non-existent. The binary format is designed from scratch, thus each datatype benefited from a scrupulous attention in order to have a compact and coherent binary representation. Read the full specification in bin_paradict_spec.md ! Back to top Application programming interface The API exposes four foundational classes, Encoder, Decoder, Packer, and Unpacker, that serialize and deserialize data iteratively. On top of these classes, four functions, encode, decode, pack, and unpack, do the same thing but in bulk. Then there are additional classes and functions to do various stuff such as the TypeRef class for types customization, the ConfigFile class for configuration files, load and dump functions for reading and writing binary Paradict file, etc. Note that this section is just an overview of the API, thus it doesn't replace the module documentation. Explore module documentation Textual serialization Encoder and Decoder are the foundation classes for serializing and deserializing data. These classes process data iteratively. On top of these classes, two functions, encode and decode, do the same thing but in bulk. Three additional classes which are Document, FileDoc, and ConfigFile, offer a document abstraction for interacting with serialized data. A document can be divided into sections, each having a header and a dictionary body. Under the hood, document abstraction uses Braq Using the Encoder class The Encoder constructor accepts mode, type_ref, skip_comments and skip_bin_data as arguments. The encode method of this class takes as input a Python dictionary, then iteratively serialize it, yielding a line after another. from paradict import Encoder data = {"id": 42, "name": "alex"} encoder = Encoder() # mode=const.DATA_MODE lines = list() for r in encoder.encode(data): lines.append(r) print("\n".join(lines)) Output: "id": 42 "name": "alex" The same code but with constructor parameter mode set to const.CONFIG_MODE would output: id = 42 name = "alex" Using the Decoder class The Decoder constructor accepts type_ref, receiver, obj_builder and skip_comments as arguments. The feed method of this class takes as input a multiline string that represent the data to deserialize. This string can be fed up to the deserializer, line by line. from paradict import Decoder text = 'id = 42\nname = "alex"' decoder = Decoder() decoder.feed(text) if decoder.queue.buffer: decoder.feed("\n") decoder.feed("===\n") # end of stream data = decoder.data print(type(data)) print(data) Output: {'id': 42, 'name': 'alex'} Using the encode function The encode function accepts data, mode, type_ref, skip_comments, and skip_bin_data as arguments. from paradict import encode, const data = {"id": 42, "name": "alex"} # DATA MODE r = encode(data) # mode==const.DATA_MODE print("DATA MODE") print(r) # CONFIG MODE r = encode(data, mode=const.CONFIG_MODE) print("\nCONFIG MODE") print(r) Output: DATA MODE "id": 42 "name": "alex" CONFIG MODE id = 42 name = "alex" Using the decode function The decode function accepts type_ref, receiver, obj_builder, and skip_comments as arguments. from paradict import decode # for the sake of the example, # the 'id' key-value line follows the DATA mode # and the 'name' key-value line follows the CONFIG mode data = """\ "id": 42 name = "alex" """ r = decode(data) print(r) Output: {'id': 42, 'name': 'alex'} Using the document abstraction The ConfigFile class is based on the FileDoc class, itself inheriting from the Document class. Using the Document class The Document class offers a document abstraction made of sections. Thus, via get and set methods, one might read and write the body of a specific section by providing its header. The Document class has a built-in data validation mechanism as it accepts a schema as optional constructor argument. Following are methods exposed by the Document class: get, set, check, render, load_schema, validate, load_from, save_to, remove, and clear. from paradict import Document init_text = """\ id = 42 name = "alex" [misc] project = "paradict" is_open_source = True """ doc = Document(init_text) # the default header argument is an empty string assert doc.get("") == {"id": 42, "name": "alex"} # get the misc section assert doc.get("misc") == {"project": "paradict", "is_open_source": True} # edit the body of the misc section doc.set("misc", {"pi": 3.14}) assert doc.get("misc") == {"pi": 3.14} Using the FileDoc class The FileDoc class works as the Document class, except that it is linked to an actual file. This class also has additional methods which are update, load, and save. from paradict import FileDoc path = "/home/alex/file_doc.dict" file_doc = FileDoc(path) file_doc.set("", {"key": "value"}) Using the ConfigFile class The ConfigFile class works like the FileDoc class, except that it is linked to an actual configuration file, so its encoding mode is implicitly set to CONFIG_MODE. from paradict import ConfigFile path = "/home/alex/app_config.dict" file_doc = ConfigFile(path) file_doc.get("section", skip_comments=False) # by default, comments are skipped Miscellaneous functions Under the hood, the Deserializer class uses a public function for splitting a key-value line into three parts: * the key, * the value, * and the separator character. from paradict import split_kv key_val = "my_key = 'my value'" info = split_kv(key_val) # info is a namedtuple containing # the key, the value, the separator char # which is either a colon ':', or an # equal '=', and also the mode which is either # const.CONFIG_MODE or const.DATA_MODE key, val, sep, mode = info Binary serialization Packer and Unpacker are the foundation classes for serializing and deserializing data. These classes process data iteratively and on top of them, two functions, pack and unpack, do the same thing but in bulk. Two additional functions, load and dump offer to read and write binary files. Using the Packer class The Packer constructor accepts type_ref, and skip_comments as arguments. The pack method of this class takes as input a Python dictionary, then iteratively serialize it, yielding a binary datum (or part of it) after another. from paradict import Packer, stringify_bin data = {"id": 42, "name": "alex"} packer = Packer() lines = list() buffer = bytearray() for d in packer.pack(data): buffer.extend(d) print(stringify_bin(buffer)) Output: \x01\x43\x69\x64\xc5\x45\x6e\x61\x6d\x65\x45\x61\x6c\x65\x78\xff Using the Unpacker class The Unpacker constructor accepts type_ref, receiver, obj_builder and skip_comments as arguments. The feed method of this class takes as input some binary data that represent the data to deserialize. This binary data can be fed up to the deserializer, by small amount of chunks. from paradict import pack, Unpacker data = {"id": 42, "name": "alex"} d = pack(data) unpacker = Unpacker() unpacker.feed(d) assert unpacker.data == data Using the pack function The pack function accepts data, type_ref, and skip_comments as arguments. from paradict import pack, stringify_bin data = {"id": 42, "name": "alex"} # DATA MODE r = pack(data) print(stringify_bin(r)) Output: \x01\x43\x69\x64\xc5\x45\x6e\x61\x6d\x65\x45\x61\x6c\x65\x78\xff Using the unpack function The unpack function accepts raw, type_ref, receiver, obj_builder, and skip_comments as arguments. from paradict import pack, unpack data = {"id": 42, "name": "alex"} d = pack(data) r = unpack(d) assert data == r Load and dump from paradict import dump, load path = "/home/alex/user_card.bin" data = {"id": 42, "name": "alex"} # Serialize and write data to user_card.bin dump(data, path) # Read and deserialize data r = load(path) # test assert data == r Miscellaneous functions The library exposes some public miscellaneous functions to play with binary data: * forge_bin function to generate a bytearray forged with the provided arguments which can be of bytes, byterarrays, integers, * stringify_bin function that returns the hexadecimal string representation of some binary data given as argument. from paradict import stringify_bin, forge_bin args = (b'\x01', b'\x02', None, 3) r = forge_bin(*args) print(stringify_bin(r)) Output: \x01\x02\x03 Type customization The classes and functions for (de)serializing data, all accept an instance of TypeRef. TypeRef is the class that is at the core the type customization mechanism. For example, one might want to only use Python's OrderedDict instead of the regular dict: from collections import OrderedDict from paradict import TypeRef, decode data = """\ pi = 3.14 user = (dict) id = 42 name = "alex" """ type_ref = TypeRef(dict_type=OrderedDict) r = decode(data, type_ref=type_ref) assert type(r) is OrderedDict assert type(r["user"]) is OrderedDict assert r == {"pi": 3.14, "user": {"id": 42, "name": "alex"}} Also with TypeRef, one could adapt some exotic datatype, thus it will conform with Python datatypes allowed for serialization: from paradict import TypeRef, encode class CapitalizedString(str): # an exotic type pass type_adapter = lambda s: s.capitalize() adapters = {CapitalizedString: type_adapter} type_ref = TypeRef(adapters=adapters) data = {"name": CapitalizedString("alex")} r = encode(data, type_ref=type_ref) print(r) Output: "name": "Alex" Back to top Continuous data stream processing Paradict supports both textual and binary continuous data stream processing. Textual stream Following is a heavily commented code snippet for performing continuous data stream processing: from paradict.serializer.encoder import Encoder from paradict.deserializer.decoder import Decoder # This stream is made of messages # Each message is a dictionary that serves as envelope stream = [{0: "a"}, {0: "b"}, {0: "c"}] # Result will hold the unpacked messages result = list() # instantiate encoder and decoder encoder = Encoder() # the receiver takes as argument the reference to the decoder decoder = Decoder(receiver=lambda ref: result.append(ref.data)) # iterate over the stream to pack each message into datums # that will feed the decoder which will call the receiver # after each complete unpacking of a message. # The decoder holds a reference to the latest # unpacked message via the "decoder.data" property for i, msg in enumerate(stream): for line in encoder.encode(msg): decoder.feed(line + "\n") decoder.feed("===\n") # check if datum is well unpacked assert msg == decoder.data # decoder.data holds unpacked data # check if the original stream contents is mirrored in # the result variable assert stream == result Binary stream Following is a heavily commented code snippet for performing continuous data stream processing: from paradict.serializer.packer import Packer from paradict.deserializer.unpacker import Unpacker # This stream is made of messages # Each message is a dictionary that serves as envelope stream = [{0: "a"}, {0: "b"}, {0: "c"}] # Result will hold the unpacked messages result = list() # instantiate packer and unpacker packer = Packer() # the receiver takes as argument the reference to the unpacker unpacker = Unpacker(receiver=lambda ref: result.append(ref.data)) # iterate over the stream to pack each message into datums # that will feed the unpacker which will call the receiver # after each complete unpacking of a message. # The unpacker holds a reference to the latest # unpacked message via the "unpacker.data" property for i, msg in enumerate(stream): for datum in packer.pack(msg): unpacker.feed(datum) # check if datum is well unpacked assert msg == unpacker.data # unpacker.data holds unpacked data # check if the original stream contents is mirrored in # the result variable assert stream == result Back to top Paradict schema for data validation A Paradict schema is a dictionary containing specs for data validation. A spec is either simply a string that represents an expected data type, or a Spec object that can contain a checking function for complex validation. Supported spec strings are: dict, list, set, obj, bin, bin, bool, complex, date, datetime, float, grid, int, str, time Code snippet: from paradict import validate from paradict.validator import Spec # data data = {"id": 42, "name": "alex", "books": ["book 1", "book 2"]} # schema schema = {"id": Spec("int", lambda x: 40 < x < 50), "name": "str", "books": ["str"]} assert validate(data, schema) is True Back to top Miscellaneous The beautiful cover image is generated with Carbon. Back to top Testing and contributing Feel free to open an issue to report a bug, suggest some changes, show some useful code snippets, or discuss anything related to this project. You can also directly email me. Setup your development environment Following are instructions to setup your development environment # create and activate a virtual environment python -m venv venv source venv/bin/activate # clone the project then change into its directory git clone https://github.com/pyrustic/paradict.git cd paradict # install the package locally (editable mode) pip install -e . # run tests python -m unittest discover -f -s tests -t . # deactivate the virtual environment deactivate Back to top Installation Paradict is cross-platform. It is built on Ubuntu and should work on Python 3.5 or newer. Create and activate a virtual environment python -m venv venv source venv/bin/activate Install for the first time pip install paradict Upgrade the package pip install paradict --upgrade --upgrade-strategy eager Deactivate the virtual environment deactivate Back to top About the author Hello world, I'm Alex, a tech enthusiast ! Feel free to get in touch with me ! Back to top About Streamable multi-format serialization with schema Topics serialization streaming schema binary textual configfile multiformat Resources Readme License MIT license Activity Stars 3 stars Watchers 2 watching Forks 0 forks Report repository Releases No releases published Packages 0 No packages published Languages * Python 100.0% Footer (c) 2023 GitHub, Inc. Footer navigation * Terms * Privacy * Security * Status * Docs * Contact * You can't perform that action at this time.