

Serialize Specification:

Problem:

CPickle and pickle use an inefficient serialization algorithm that results in both slow encoding/decoding speeds as well as bloated streams. The algorithm uses the simplest possible implementation in order to avoid complexities such as endian byte ordering when sending pickled objects from machine to machine across a network.

Solution:

	Use an algorithm that is more efficient for transmitting serialized python objects.

Build instructions:

	Edit Makefile for Python paths
	The type 'make'

	Running benchmark.py will print statistics when compared to cPickle.

Algorithms:

Each data type generally is encoded into a packet. The packet contains a header indicating what it is (a string, float, int ect,). And the rest of the packet is the value. 	
It is possible if the for integer values that are small enough to be encoded into the header itself using bits that are not assigned to indicate a type. It is also possible with integers and floating point values to encode them in fewer bytes by maintaining a mask of bits in the header where each bit corresponds to a non zero byte, thus a number like
0x1000004A can be encoded into two bytes like this:

mask in the header:  1 0 0 1 binary, followed by 0x10 and 0x4A

During decoding we first set all byts to zero in the long then look at the mask.
Each set bit means we read a byte after the header and fill in the corresponding 
Byte on the long value.

The same algorithm can be applied to double floating point values to encode them in
As few bytes as possible, only in this case you need a whole byte for the mask. It should be noted that the algorithm for the double needs to be smart enough to know that if there are more than 6 bytes needed to store the non zero bytes of a an 8 byte double value than this scheme should not be used since no savings can come from it.

At the head of each packet should be an indicator of the endianess of the host machine.
This will allow for byte reordering if we are transporting pickled data to and from big and little endian machines.




Data type headers:

Integer:

8 7 6 5  4 3 2 1
1 x x x  x x x x

Bit 7 set indicates that bits 1-6 contain the actual value of the integer packet into  the header. 

Bit 6 set and Bit 7 clear indicates that bits 1-4 are used as a four byte mask to encode non zero values as described in the algorithm section above.

	String

	8 7 6 5  4 3 2 1 
	0 1 x x  x x x x

	Bit 6 indicates wether or not the length is contained in bites 1-5
	If clear than an encoded integer follows containing the length

List

8 7 6 5  4 3 2 1
0 0 1 x  x x x x

	Bit 5 indicates that the length is in bits 1-4 if it is clear than encoded integer
	Contains the length
		         	
Tuple

8 7 6 5  4 3 2 1
0 0 0 1  x x x x

	Bit 4 indicates that the length is in bits 1-3 if it is clear than encoded integer
	Contains the length

Dictionary

8 7 6 5  4 3 2 1
0 0 0 0  1 x x x

	If bits 








  


