.if n .pH arch.mem_mgr @(#)mem_mgr	40.1

.ig
imaged dc 8/30/89
docformat -c -s4 -dqm4 fn
..
.\" turn off automatic hyphenation -- JHevelin (08-09-89)
.am RT
.nh
..
.ds S) Shapes
.BK "Shapes Internal Architecture"
.CH "Memory Management" "A"
.H 1 "Memory Management" 
.H 2 "Introduction"
.nh
This appendix discusses the preferred schemes for
memory management for X11/NeWS.
These schemes, HPA (Homogeneous Page Allocation)
and POM (Page-Oriented Malloc),
optimize different types of memory usage.
.H 2 "Goals"
.P
The topics covered in this appendix are listed
below in order of decreasing priority:
.BL
.LI
localization of related data onto ``hot'' pages
.LI
fast allocation/deallocation with minimal page ``hits''
.LI
minimal overhead for allocated blocks
.LI
simplicity of code interface
.LI
monitoring/debugging capabilities and ease of use
.LE
.P
Because of the differing localization needs of
various objects in the server,
at least two cooperating memory
management schemes will be used.
It is necessary for one mechanism
to be able to recognize and administer both
schemes to improve sharing of the free pages in
the system.
Each page obtained by the system will be assigned
to one of the memory management schemes until it
is no longer needed
(all objects are completely freed).
Once freed, it then goes into a common pool of
free pages to await assignment to another scheme.
.H 2 "Terminology"
.P
.B block
.P
A section of memory that has been given
to the program as a result of a memory allocation
request.
.P
.B "free block"
.P
A section of memory that is available to be
given to the program to satisfy an allocation request.
.P
.B "free page"
.P
A page available to be apportioned for free blocks.
It may be completely free of allocated blocks or
just have large spaces on it that are not part
of any allocated blocks.
Different algorithms may draw this distinction
differently.
.\" begin new page
.P
.B "hierarchical objects"
.P
An object that is linked together with objects
of the same type into some type of hierarchy.
The best example of this in NeWS is the canvas
hierarchy.
.P
.B "page"
.P
A large portion of the process address space
that contains blocks that are to be given out
to satisfy memory allocation requests.
The page could be the same size as the machine's
memory management page size,
but could easily be a multiple or fraction of a
machine page.
This inequality affects various algorithms
differently.
.P
.B "page fragmentation"
.P
A block allocated across machine pages unnecessarily
(i.e., begins and/or ends on some address that is
not on a machine page boundary).
Note that some algorithms would not be affected
much by allocating a multiple number of pages
(which would begin and end on the boundaries of
different pages),
but all of them would be less efficient if the
allocation pages did not fall neatly onto
page boundaries.
.P
.B "spaghetti objects"
.P
An object which points at several different objects
(and may be pointed to by several others) haphazardly.
A good example of this in NeWS is a PostScript dict,
which can contain a number of any other kind
of PostScript objects.
.H 2 "Homogeneous Page Allocation (HPA)"
HPA handles the localization needs of hierarchical objects.
Since these objects are always scanned in
hierarchies of objects of the same type,
they will be allocated from a set of pages
containing only blocks of the appropriate size
for these objects.
These objects share these pages only with other
objects of the same type.
Canvasses and events are included in this scheme.
Since the size of these objects is known ahead of time,
there is no per-block overhead for these objects.
.H 2 "Page Oriented Malloc (POM)"
.P
POM handles \f2spaghetti objects\fP,
related objects of different types,
and different sizes that point to each other.
In this scheme,
each set of related spaghetti objects is assigned
its own free-page pool.
Examples of spaghetti object groupings include
the following:
.BL
.LI
PostScript procedures
.LI
arrays read from a given file
.LI
graphics contexts,
transforms,
and paths for one NeWS process	
.LI
descriptors and data related to a font
.LE
.P
The per-block overhead for these objects is the
storage size of the object.
.H 2 "Additional Development"
.P
A third scheme could be developed for certain
usages in which memory blocks are permanently allocated.
An example of this would be the PostScript keywords,
which are not reference counted,
and therefore, once created, are never freed.
The implementation of this scheme would be
similar to the POM,
but would not require the overhead of storing the
size with the object.
Currently, the PostScript keywords are the only
such example,
and they are already being rewritten to use their
own private allocation scheme,
so this third scheme is only hypothetical.
.P
What should be done about requests that don't fit
into one of the above categories?
The default algorithm should be based on one of
the HPA or POM schemes.
The POM scheme can handle objects of indeterminate size,
but at the overhead of a variable-sized,
free-list mechanism.
The mechanism is designed to make the most of
scavenging adjacent free blocks of arbitrary size
into larger blocks to reduce dead space and allow
for freeing pages that are unused.
The HPA scheme has a much simpler free-block
handling scheme.
No attempt is made to coalesce adjacent free
blocks since that is forbidden by the design.
HPA requires devoting whole pages to each
block-size to be supported.
.P
The solution is to use both schemes in the
default case,
each where it is more appropriate.
Objects for which no pool is specified are
allocated from a single global default POM pool
if they exceed the threshold,
and from any of a number of HPA pools containing
objects of frequently-allocated sizes if they are
smaller than the threshold.
The choice of threshold and the distribution of
sizes for the default HPA pools is subject to
empirical considerations and needs to be tuned
for the running server.
.P
Blocks of extremely large size (greater than 1 page)
cannot be easily handled by either POM or HPA,
so an additional scheme must be used.
More data needs to be collected about objects of
extremely large sizes,
but use of the vmstatus operator shows that such
blocks account for at least half of memory usage
on a monochrome server
(due to retained canvases).
This mechanism must share its free pages with the
HPA and POM schemes.
The HPA and POM schemes obtain new pages by
making requests for 1-page chunks from this
underlying Page Allocator.
These schemes free their pages back to the PA for
integration with other free pages.
Requests for blocks larger than one page are sent
directly to the PA.
.H 2 "Design Details"
There is a prototypical implementation of this system:
.H 2 "Page Allocator"
.P
Handles alloc/free requests for multiples of
pagesize only.
It maintains a global \f2page descriptor\fP table
that includes information about each page:
.BL
.LI
which scheme is using it (PA, HPA, POM)
.LI
which pool it is assigned to
.LI
size information for HPA pages
.LI
free list linkages for POM pages
.LI
number of contiguous pages for huge PA blocks.
.LE
.H 2 "Homogeneous Page Allocation"
.P
Handles objects of the same size coming from a
pool devoted to that size.
Information about the size of the objects is kept
in the pool descriptor.
New pages to expand a filled pool and empty pages
that can be removed from the pool are handled by PA.
.H 2 "Page-Oriented Malloc"
.P
Handles objects of variable size coming from a
pool devoted to a set of objects by the server.
Information about the complex free lists is kept
in the page descriptor table maintained by PA.
New pages to expand a filled pool and empty pages
that can be removed from the pool are handled by PA.
.P
The base memory-allocation function takes a pool
descriptor and a size as arguments.
For HPA, the size is ignored
(or possibly checked against the recorded size
for that pool),
and a block is allocated from the free list
for that pool.
For POM, a block of the requested size is found
from the free lists of the pages assigned to the
passed pool.
For a generic request,
the pool descriptor is NULL,
and a block of the requested size is obtained
from a single global POM list
(for relatively large objects)
or from one of several tuned HPA lists
(for small objects).
If the size is above the HPA/POM threshold of one page,
the pool descriptor is ignored and the block
is allocated from the PA.
The page descriptors for the returned pages
(must be contiguous)
are marked as being part of one large set of
contiguous pages.
.P
The base memory-freeing function takes an address only.
The corresponding page descriptor is looked up in
the global descriptor list.
If the page is an HPA page,
the block is placed on the free list for that page.
If the page is a POM page,
the block is merged into the free list for that page
(coalescing with adjacent free blocks).
If the page is a PA page,
all of the corresponding pages are freed to the
free page pool.
If a POM or HPA page is completely free
(as determined by a count of free space in the
page descriptor),
the page is released to the free page pool.
.P
A new HPA page is requested from the PA when the
current allocating page becomes full and there is
no page on the pool descriptor's partially-free
page list.
The new page is immediately broken up into a free
list of fixed-size objects.
.P
A new POM page is requested from the PA when the
current allocating page cannot support a given request,
and a search through the page descriptors
of the pages assigned to that pool shows no pages
able to support the request.
The new page is broken up into one block of a
size of the whole page,
which becomes that page's free list.
.H 2 "Use of the Proposed Scheme"
.BL
.LI
One PA pool maintains the list of all free pages
in the system.
.LI
A few (initially none until data is obtained)
global HPA pools satisfy requests for which no
pool was specified and which are close to a few
frequently requested sizes.
.LI
One global POM pool satisfies requests for which
no pool was specified and a decent HPA match
could not be found.
.LI
A global Shapes pool that can be set from the
server to control the locality of the objects
that Shapes allocates on its behalf.
This should be a POM pool.
.LI
A special \f2hot\fP POM pool for allocation of short-lived,
growable Shapes objects.
.LI
One HPA pool is for events.
.LI
For each root canvas:
.BL
.LI
One HPA pool for rooted canvases.
.LI
A POM pool for data related to the rooted canvases,
so that the path, context, transform,
and raster objects can be obtained from Shapes.
Also in this pool would be the scene and screen structures.
.LE
.LI
A POM pool per process containing the process body,
its stacks, all GC data associated with it,
and any dicts, strings, or arrays it creates.
.LI
A POM pool per defined font.
.LI
The fonts read from files generating huge
requests and obtaining their own pages can be
allocated from the global pools.
.LI
Either of the following exist:
.BL
.LI
A POM pool per executed file for the PostScript
arrays in it.
This method allows user-tuneable PostScript code
packaging.
.LI
An HPA pool for all executable array bodies,
and a POM pool for all executable array object arrays.
This method allows promoting hot executable arrays
(object array only, not the multi-referenced body).
The keywords are being reworked to call the PA
directly for their pages.
.LE
.LE
.P
Functional overrides for qalloc and QALLOC,
as well as malloc, still exist,
and cause allocation from the global
(i.e., no pool specified) pools,
so that the above can be implemented for now.
Any further needs can be implemented as needed.
.H 2 "Review of Schemes"
.P

.H 2 "Homogeneous Pages Allocation"
.P
Requested blocks are allocated from pages
containing only blocks of the same size.
Freed blocks are put on the free list for their
own page.
HPA pools  exist either for the programmer to
localize a given type of object
(ideal for hierarchical objects)
or for a particular size that appears to be
requested frequently
(such choices would have to be tuned).
.P
The free list for a given pool points at the free
list for one of the pages assigned to that pool.
When that page's free list runs out of free blocks,
a new page is found and its free list is used.
Various algorithms for selecting the next hot
allocation page could be dropped in such as:
.BL
.LI
Page with lowest address
.P
Compacts blocks towards bottom of memory.
.LI
Page with most free space
.P
Reduces time spent looking for next hot page.
Increases locality of blocks allocated within
a short span of time.
.LI
Page with least free space
.P
Allows nearly free pages to wither away
and return to the free page pool.
Puts blocks near large accumulations of other
nonfree blocks.
.LE
.P
Pages that are assigned to a bucket are immediately
split up into as many like-sized objects as possible
and put on that page's free list
(which should also be the pool's current free
list).
Fewer calls to the allocation function are needed,
since each call results in a large number of
blocks available to the macros immediately,
although the inherent re-configuration problems
with macros may warrant their elimination.
The information about the size of the objects on
a page is kept in the pool descriptor.
.P
Specific needs for given hierarchies of
similarly-sized objects to be kept hot could be satisfied.
Examples of such are canvases and events,
which are usually involved in incestuous accesses
(canvas tree searches and event-interest compares).
These objects are also involved in access
attempts that are critical to the ``perceived
response time'' of the server
(cursor movement, event distribution).
.H 2 "Pros and Cons"
.BL
.LI
Quick allocation of a block \(em same as qalloc or better.
.LI
Excellent localization depending on the choice of
private pools by developers.
.LI
Fair/poor localization afforded to generic block
requests,
but probably better than QALLOC.
This includes spaghetti objects such as PostScript
dicts/arrays.
.LI
Excellent use of available space on ``free pages''.
.LI
Free lists expanded by quantum chunks could result
in many more blocks waiting around on free lists.
Tuning could reduce this overhead.
.LI
Excellent page fragmentation integral to the design.
.LI
Very small overhead for the allocated blocks.
Most objects are quite small
(but account for about 25% of all allocations),
which increases the cost of any overhead.
.LE
.H 2 "Page Oriented Malloc"
.P
Maintain a per-page free list as in the Homogeneous
Pages scheme and maintain a hierarchy of pages in a
global free pages list.
The global free pages list would link the pages
in such a way as to allow a search for a page
with enough free space for a request to proceed quickly.
The global list will contain such information as
how many bytes are free on the page,
so that the list can be maintained without
touching the involved pages.
.P
There is a current hot allocation page,
on which is a current ``hot allocation block'',
from which blocks are obtained quickly
(by lopping off of its end)
from that page.
Otherwise, follow the free list of pages and
try from them.
If none of the free list pages can
support the request,
then obtain a new page.
.P
Free by coalescing a block with its neighbors if
they are free,
putting the block on its page's free list,
and updating the free byte count of the page.
.H 2 "Pros and Cons"
.BL
.LI
Fairly quick allocation \(em slightly slower than qalloc.
.LI
Excellent data locality for some objects,
such as PostScript proc arrays,
since successive allocs tend to come from the
same page.
.LI
Poor data locality for objects that need to be near
each other but are allocated over time,
such as canvases and events.
.LI
Similar overhead to qalloc \(em integer size.
Good use of available space on ``free pages''
Excellent page fragmentation integral to the design.
.LE
.H 2 "Hints"
.P
.B "Stop storing the size in the blocks."
.P
I believe that for 70% of the objects,
the function doing the qfree could tell it
exactly how big the object was
(e.g., feed it the known CONSTANT size).
For another 20%,
the size is stored somewhere in or around the
object and could be fed to qfree.
Another 5% could have their size calculated.
To store the size for
\f2every\f1
object just so that the remaining 5% can remain
clueless is wasteful.
.P
The prototype implementation of POM needs the
size of the object and its predecessor for
efficient coalescing.
.P
.B "Allocate hierarchical objects in arrays."
.P
Related objects that are always scanned with
hierarchies of similar objects could be allocated
several at a time so that the hierarchies will
exist on a few related ``hot'' pages.
This applies well to canvasses and events that
are constantly being scanned and compared/checked,
but not really to any other objects in our system.
The circumstances under which these objects are
scanned are also quite common
(mouse movement, keystrokes, etc.)
in areas greatly affecting perceived performance.
Use of separate HPA pools here would take care of this.
.P
.B "Special case keywords"
.P
Since keywords are not refcounted,
they can be allocated with little overhead
(they will never be put on a free list)
on their own pages.
Currently, they exist all throughout memory
(wherever allocated data was coming from when
they were encountered in an input stream),
and contain some unnecessary overhead.
.P
.B "Special case processes"
.P
Processes essentially contain one static
structure that points to three growable structures:\ \&
the operand, dictionary, and execution stacks.
The static process ``core'' also points to other
structures that are less well-controlled,
such as the gc stack.
The static process ``core'' is the only object in
the system that can point to the three growable structures.
Thus, if these structures grow,
only one pointer must be changed to relocate them.
Special functions could take care of allocating
and growing these ``children'' of the process to
keep them localized.
Giving them their own POM may be enough.
.P
.B "Analyze spaghetti trees and allocate them in one request"
.P
Take canvases, for instance:\ \&
all screen canvases have a RAS_TYPE_CHILD
of the screen raster,
and all orphan canvases have a RAS_TYPE_MEMORY raster.
This raster object is used every time rendering
is done to the canvas.
All rasters also have a shapes path object to
hold their clip path which is used in rendering,
clipping, and cursor movement.
Rather than allocate such objects separately and
have them point at each other,
allocate them together.
This is similar but simpler than the special
casing needed for processes,
since these structures don't grow.
This decreases the tightness of the
canvas hierarchy.
.P
.B "Implement debugging code above the memory management scheme"
.P
In the normal optimized server,
the memory management scheme should return blocks
exactly as requested
(possibly aligned to an n-byte boundary or
rounded up to aid in fragmentation).
In the debug server,
additional code should come into play
which intercepts the raw server requests,
adds a bit to the size,
passes the new request on to the base memory
management scheme,
receives the block back into which it puts the
debugging information,
and finally returns the block that follows
its debugging header to the server.
When blocks are freed,
the free request would be intercepted
(in the debugging server only),
and the debugging information checked before the
block is actually returned to the base memory
management routines.
An aid to this would be to eliminate the
allocation macros,
so that the replacement of one module in the
server could enable debugging or monitoring of
the memory allocation.
Debugging possibilities for the
current mechanism need work.
