Subj : Re: Block based virtual harddisk library To : comp.programming From : Ed Prochak Date : Thu Sep 15 2005 03:01 pm M.Barren@gmail.com wrote: > Hi, > > I'm trying to implement a library that will provide API to access a > file as if it is a harddisk with a fixed capacity of 281474976710656 > (max. of 48bit uint) blocks of specific size (eg. 100B-4KB). The access > to the virtual disk is block based (not byte based). > > Since the size of this virtual disk is beyond the phyisical limit of my > disk, I can only afford to save the useful data (eg. non-zero) to an > actual file (container) on disk. > > **The only thing that I'm particularly trying to achieve is to make it > as storage efficient as possible. At this point, fragmentation (of > consequtive blocks in te virtual disk) is not a concern.** > > so What I've come up with so far is: > > To keep a table that contains pointers that map (non-zero) blocks from > virtual disk to blocks within the normal file. Everytime the container > is opened by the library, the table is loaded into a hashtable (or a > trie structure) for fast search. On each access to a certain block on > virtual disk, the hashtable is searched and upon finding a match, the > address of the corresponding block in the container will be known. Each > entry in the table will take up 12 bytes (6 for block address on > virtual disk, 6 for block address in the container). Since hashtable > and trie structures contain extra data for each entry, memory can > become a problem when having large number of blocks (1 million+). > > Having 1000 blocks written to the container file, one might write the > first block with zeros which would then quilify that block as useless. > It then needs to be removed but since it resides at the begining of > file, we cannot just move all the other blocks in front of it to fill > its space. Hence, another table is needed to keep pointers to useless > (all-zero) blocks. So on each new block allocation in the container > file, the all-zero blocks will be used before extending the file size > of the container. Each entry in this table will take up 6 bytes. > > *I definitly need some help/advice on the memory problem that the > address table will cause. > > If you have any idea that you think I need to know before starting off, > please let me know. Maybe you can direct me to some other similar > implementations or papers that would help me understand the problem > better. > > Michael > (Excuse my verbose method of writing. English is my 2nd language, so I > am not yet to compress my sentences to a sufficient level) This seems closely related to the virtual memory problem of operating systems. Maybe you can get some ideas from searching that topic? Meanwhile, consider the extreme case: ALL of your physical space is allocated. You need enough entries in your structure to handle that and you "free" structure is empty. If you have enough memory to handle that case, ... hmmm You are promising access to 256TERA Blocks! yikes that is BIG! Are you really sure you need to keep ALL the information in memory (in a hash table or trie)?? can you keep at least part of it on the disc? Block chain pointers, parts of the hash table itself, stored out on the disc can help. In fact this solves your free block problem. All you need in memory is the location of the first free physical disc block. Since these blocks are free, you can store freelist pointers in each one, chaining them together. When the API frees a block, you insert it on the head of the free list. When the API allocates a block, you return the current head of the list and set the new head to the next one on the list. So you do NOT need to keep the list of ALL free blocks in memory. I still have no easy solution for the allocated block list. with 6+6bytes per block ID, you need 3072TERABytes, if all Blocks are allocated. That doesn't leave much room in memory for program code. 8^) Let us know how you are coming along. I'd love to hear more about this. HTH, ed .