https://dataorienteddesign.com/dodbook/node2.html
next up previous contents
Next: Relational Databases Up: Data-Oriented Design Previous:
Contents Contents
Online release of Data-Oriented Design :
This is the free, online, reduced version. Some inessential chapters
are excluded from this version, but in the spirit of this being an
education resource, the essentials are present for anyone wanting to
learn about data-oriented design.
Expect some odd formatting and some broken images and listings as
this is auto generated and the Latex to html converters available are
not perfect. If the source code listing is broken, you should be able
to find the referenced source on github. If you like what you read
here, consider purchasing the real paper book from here, as not only
will it look a lot better, but it will help keep this version online
for those who cannot afford to buy it. Please send any feedback to
support@dataorienteddesign.com
Subsections
* It's all about the data
* Data is not the problem domain
* Data and statistics
* Data can change
* How is data formed?
* The framework
* Conclusions and takeaways
---------------------------------------------------------------------
Data-Oriented Design
Data-oriented design has been around for decades in one form or
another but was only officially given a name by Noel Llopis in his
September 2009 article[#!NoelDOD!#] of the same name. Whether it is,
or is not a programming paradigm is seen as contentious. Many believe
it can be used side by side with other programming paradigms such as
object-oriented, procedural, or functional programming. In one
respect they are right, data-oriented design can function alongside
the other paradigms, but that does not preclude it from being a way
to approach programming in the large. Other programming paradigms are
known to function alongside each other to some extent as well. A Lisp
programmer knows that functional programming can coexist with
object-oriented programming and a C programmer is well aware that
object-oriented programming can coexist with procedural programming.
We shall ignore these comments and claim data-oriented design as
another important tool; a tool just as capable of coexistence as the
rest. ^1.1
The time was right in 2009. The hardware was ripe for a change in how
to develop. Potentially very fast computers were hindered by a
hardware ignorant programming paradigm. The way game programmers
coded at the time made many engine programmers weep. The times have
changed. Many mobile and desktop solutions now seem to need the
data-oriented design approach less, not because the machines are
better at mitigating an ineffective approach, but the games being
designed are less demanding and less complex. The trend for mobile
seems to be moving to AAA development, which should bring the return
of a need for managing complexity and getting the most out of the
hardware.
As we now live in a world where multi-core machines include the ones
in our pockets, learning how to develop software in a less serial
manner is important. Moving away from objects messaging and getting
responses immediately is part of the benefits available to the
data-oriented programmer. Programming, with a firm reliance on
awareness of the data flow, sets you up to take the next step to
GPGPU and other compute approaches. This leads to handling the
workloads that bring game titles to life. The need for data-oriented
design will only grow. It will grow because abstractions and serial
thinking will be the bottleneck of your competitors, and those that
embrace the data-oriented approach will thrive.
It's all about the data
Data is all we have. Data is what we need to transform in order to
create a user experience. Data is what we load when we open a
document. Data is the graphics on the screen, the pulses from the
buttons on your gamepad, the cause of your speakers producing waves
in the air, the method by which you level up and how the bad guy knew
where you were so as to shoot at you. Data is how long the dynamite
took to explode and how many rings you dropped when you fell on the
spikes. It is the current position and velocity of every particle in
the beautiful scene that ended the game which was loaded off the disc
and into your life via transformations by machinery driven by decoded
instructions themselves ordered by assemblers instructed by compilers
fed with source-code.
No application is anything without its data. Adobe Photoshop without
the images is nothing. It's nothing without the brushes, the layers,
the pen pressure. Microsoft Word is nothing without the characters,
the fonts, the page breaks. FL Studio is worthless without the
events. Visual Studio is nothing without source. All the applications
that have ever been written, have been written to output data based
on some input data. The form of that data can be extremely complex,
or so simple it requires no documentation at all, but all
applications produce and need data. If they don't need recognisable
data, then they are toys or tech demos at best.
Instructions are data too. Instructions take up memory, use up
bandwidth, and can be transformed, loaded, saved and constructed.
It's natural for a developer to not think of instructions as being
data^1.2, but there is very little differentiating them on older,
less protective hardware. Even though memory set aside for
executables is protected from harm and modification on most
contemporary hardware, this relatively new invention is still merely
an invention, and the modified Harvard architecture relies on the
same memory for data as it does for instructions. Instructions are
therefore still data, and they are what we transform too. We take
instructions and turn them into actions. The number, size, and
frequency of them is something that matters. The idea that we have
control over which instructions we use to solve problems leads us to
optimisations. Applying our knowledge of what the data is allows us
to make decisions about how the data can be treated. Knowing the
outcome of instructions gives us the data to decide what instructions
are necessary, which are busywork, and which can be replaced with
equivalent but less costly alternatives.
This forms the basis of the argument for a data-oriented approach to
development, but leaves out one major element. All this data and the
transforming of data, from strings, to images, to instructions, they
all have to run on something. Sometimes that thing is quite abstract,
such as a virtual machine running on unknown hardware. Sometimes that
thing is concrete, such as knowing which specific CPU and GPU you
have, and the memory capacity and bandwidth you have available. But
in all cases, the data is not just data, but data that exists on some
hardware somewhere, and it has to be transformed by that same
hardware. In essence, data-oriented design is the practice of
designing software by developing transformations for well-formed data
where the criteria for well-formed is guided by the target hardware
and the patterns and types of transforms that need to operate on it.
Sometimes the data isn't well defined, and sometimes the hardware is
equally evasive, but in most cases a good background of hardware
appreciation can help out almost every software project.
If the ultimate result of an application is data, and all input can
be represented by data, and it is recognised that all data transforms
are not performed in a vacuum, then a software development
methodology can be founded on these principles; the principles of
understanding the data, and how to transform it given some knowledge
of how a machine will do what it needs to do with data of this
quantity, frequency, and its statistical qualities. Given this basis,
we can build up a set of founding statements about what makes a
methodology data-oriented.
Data is not the problem domain
The first principle: Data is not the problem domain.
For some, it would seem that data-oriented design is the antithesis
of most other programming paradigms because data-oriented design is a
technique that does not readily allow the problem domain to enter
into the software as written in source. It does not promote the
concept of an object as a mapping to the context of the user in any
way, as data is intentionally and consistently without meaning.
Abstraction heavy paradigms try to pretend the computer and its data
do not exist at every turn, abstracting away the idea that there are
bytes, or CPU pipelines, or other hardware features, and instead
bringing the model of the problem into the program. They regularly
bring either the model of the view into the code, or the model of the
world as a context for the problem. That is, they either structure
the code around attributes of the expected solution, or they
structure the code around the description of the problem domain.
Meaning can be applied to data to create information. Meaning is not
inherent in data. When you say 4, it means very little, but say 4
miles, or 4 eggs, it means something. When you have 3 numbers, they
mean very little as a tuple, but when you name them x,y,z, you can
put meaning on them as a position. When you have a list of positions
in a game, they mean very little without context. Object-oriented
design would likely have the positions as part of an object, and by
the class name and neighbouring data (also named) you can get an idea
of what that data means. Without the connected named contextualising
data, the positions could be interpreted in a number of different
ways, and though putting the numbers in context is good in some
sense, it also blocks thinking about the positions as just sets of
three numbers, which can be important for thinking of solutions to
the real problems the programmers are trying to solve.
For an example of what can happen when you put data so deep inside an
object that you forget its impact, consider the numerous games
released, and in production, where a 2D or 3D grid system could have
been used for the data layout, but for unknown reasons the developers
kept with the object paradigm for each entity on the map. This isn't
a singular event, and real shipping games have seen this
object-centric approach commit crimes against the hardware by having
hundreds of objects placed in WorldSpace at grid coordinates, rather
than actually being driven by a grid. It's possible that programmers
look at a grid, and see the number of elements required to fulfil the
request, and are hesitant to the idea of allocating it in a single
lump of memory. Consider a simple 256 by 256 tilemap requiring 65,536
tiles. An object-oriented programmer may think about those sixty-five
thousand objects as being quite expensive. It might make more sense
for them to allocate the objects for the tiles only when necessary,
even to the point where there literally are sixty-five thousand tiles
created by hand in editor, but because they were placed by hand,
their necessity has been established, and they are now something to
be handled, rather than something potentially worrying.
Not only is this pervasive lack of an underlying form a poor way to
handle rendering and simple element placement, but it leads to much
higher complexity when interpreting locality of elements. Gaining
access to elements on a grid-free representation often requires
jumping through hoops such as having neighbour links (which need to
be kept up to date), running through the entire list of elements
(inherently costly), or references to an auxiliary augmented grid
object or spatial mapping system connecting to the objects which are
otherwise free to move, but won't, due to the design of the game.
This fake form of freedom introduced by the grid-free design presents
issues with understanding the data, and has been the cause of some
significant performance penalties in some titles. Thus also causing a
significant waste of programmer mental resources in all.
Other than not having grids where they make sense, many modern games
also seem to carry instances for each and every item in the game. An
instance for each rather than a variable storing the number of items.
For some games this is an optimisation, as creation and destruction
of objects is a costly activity, but the trend is worrying, as these
ways of storing information about the world make the world
impenetrable to simple interrogation.
Many games seem to try to keep everything about the player in the
player class. If the player dies in-game, they have to hang around as
a dead object, otherwise, they lose access to their achievement data.
This linking of what the data is, to where it resides and what it
shares lifetime with, causes monolithic classes and hard to untangle
relationships which frequently turn out to be the cause of bugs. I
will not name any of the games, but it's not just one title, nor just
one studio, but an epidemic of poor technical design that seems to
infect those who use off the shelf object-oriented engines more than
those who develop their own regardless of paradigm.
The data-oriented design approach doesn't build the real-world
problem into the code. This could be seen as a failing of the
data-oriented approach by veteran object-oriented developers, as
examples of the success of object-oriented design come from being
able to bring the human concepts to the machine, then in this middle
ground, a solution can be written that is understandable by both
human and computer. The data-oriented approach gives up some of the
human readability by leaving the problem domain in the design
document, bringing elements of constraints and expectations into the
transforms, but stops the machine from having to handle human
concepts at any data level by just that same action.
Let us consider how the problem domain becomes part of the software
in programming paradigms that promote needless abstraction. In the
case of objects, we tie meanings to data by associating them with
their containing classes and their associated functions. In
high-level abstraction, we separate actions and data by high-level
concepts, which might not apply at the low level, thus reducing the
likelihood the functions can be implemented efficiently.
When a class owns some data, it gives that data a context which can
sometimes limit the ability to reuse the data or understand the
impact of operations upon it. Adding functions to a context can bring
in further data, which quickly leads to classes containing many
different pieces of data that are unrelated in themselves, but need
to be in the same class because an operation required a context and
the context required more data for other reasons such as for other
related operations. This sounds awfully familiar, and Joe Armstrong
is quoted to have said ``I think the lack of reusability comes in
object-oriented languages, not functional languages. Because the
problem with object-oriented languages is they've got all this
implicit environment that they carry around with them. You wanted a
banana but what you got was a gorilla holding the banana and the
entire jungle."^1.3 which certainly seems to resonate with the issue
of contextual referencing that seems to be plaguing the
object-oriented languages.
You could be forgiven for believing that it's possible to remove the
connections between contexts by using interfaces or dependency
injection, but the connections lie deeper than that. The contexts in
the objects are often connecting different classes of data about
different categories in which the object fits. Consider how this
banana has many different purposes, from being a fruit, to being a
colour, to being a word beginning with the letter B. We have to
consider the problem presented by the idea of the banana as an
instance, as well as the banana being a class of entity too. If we
need to gain information about bananas from the point of view of the
law on imported goods, or about its nutritional value, it's going to
be different from information about how many we are currently
stocking. We were lucky to start with the banana. If we talk about
the gorilla, then we have information about the individual gorilla,
the gorillas in the zoo or jungle, and the class of gorilla too. This
is three different layers of abstraction about something which we
might give one name. At least with a banana, each individual doesn't
have much in the way of important data. We see this kind of
contextual linkage all the time in the real world, and we manage the
complexity very well in conversation, but as soon as we start putting
these contexts down in hard terms we connect them together and make
them brittle.
All these mixed layers of abstraction become hard to untangle as
functions which operate over each context drag in random pieces of
data from all over the classes meaning many data items cannot be
removed as they would then be inaccessible. This can be enough to
stop most programmers from attempting large-scale evolving software
projects, but there is another issue caused by hiding the actions
applied to the data that leads to unnecessary complexity. When you
see lists and trees, arrays and maps, tables and rows, you can reason
about them and their interactions and transformations. If you attempt
to do the same with homes and offices, roads and commuters, coffee
shops and parks, you can often get stuck in thinking about the
problem domain concepts and not see the details that would provide
clues to a better data representation or a different algorithmic
approach.
There are very few computer science algorithms that cannot be reused
on primitive data types, but when you introduce new classes with
their own internal layouts of data, that don't follow clearly in the
patterns of existing data-structures, then you won't be able to fully
utilise those algorithms, and might not even be able to see how they
would apply. Putting data structures inside your object designs might
make sense from what they are, but they often make little sense from
the perspective of data manipulation.
When we consider the data from the data-oriented design point of
view, data is mere facts that can be interpreted in whatever way
necessary to get the output data in the format it needs to be. We
only care about what transforms we do, and where the data ends up. In
practice, when you discard meanings from data, you also reduce the
chance of tangling the facts with their contexts, and thus you also
reduce the likelihood of mixing unrelated data just for the sake of
an operation or two.
Data and statistics
The second principle: Data is the type, frequency, quantity, shape,
and probability.
The second statement is that data is not just the structure. A common
misconception about data-oriented design is that it's all about cache
misses. Even if it was all about making sure you never missed the
cache, and it was all about structuring your classes so the hot and
cold data was split apart, it would be a generally useful addition to
your programming toolkit, but data-oriented design is about all
aspects of the data. To write a book on how to avoid cache misses,
you need more than just some tips on how to organise your structures,
you need a grounding in what is really happening inside your computer
when it is running your program. Teaching that in a book is also
impossible as it would only apply to one generation of hardware, and
one generation of programming languages, however, data-oriented
design is not rooted in just one language and just some unusual
hardware, even though the language to best benefit from it is C++,
and the hardware to benefit the approach the most is anything with
unbalanced bottlenecks. The schema of the data is important, but the
values and how the data is transformed are as important, if not more
so. It is not enough to have some photographs of a cheetah to
determine how fast it can run. You need to see it in the wild and
understand the true costs of being slow.
The data-oriented design model is centred around data. It pivots on
live data, real data, data that is also information. Object-oriented
design is centred around the problem definition. Objects are not real
things but abstract representations of the context in which the
problem will be solved. The objects manipulate the data needed to
represent them without any consideration for the hardware or the
real-world data patterns or quantities. This is why object-oriented
design allows you to quickly build up first versions of applications,
allowing you to put the first version of the design document or
problem definition directly into the code, and make a quick attempt
at a solution.
Data-oriented design takes a different approach to the problem,
instead of assuming we know nothing about the hardware, it assumes we
know little about the true nature of our problem, and makes the
schema of the data a second-class citizen. Anyone who has written a
sizeable piece of software may recognise that the technical structure
and the design for a project often changes so much that there is
barely any section from the first draft remaining unchanged in the
final implementation. Data-oriented design avoids wasting resources
by never assuming the design needs to exist anywhere other than in a
document. It makes progress by providing a solution to the current
problem through some high-level code controlling sequences of events
and specifying schema in which to give temporary meaning to the data.
Data-oriented design takes its cues from the data which is seen or
expected. Instead of planning for all eventualities, or planning to
make things adaptable, there is a preference for using the most
probable input to direct the choice of algorithm. Instead of planning
to be extendable, it plans to be simple and replaceable, and get the
job done. Extendable can be added later, with the safety net of unit
tests to ensure it remains working as it did while it was simple.
Luckily, there is a way to make your data layout extendable without
requiring much thought, by utilising techniques developed many years
ago for working with databases.
Database technology took a great turn for the positive when the
relational model was introduced. In the paper Out of the Tar Pit[#!
TarPit!#], Functional Relational Programming takes it a step further
when it references the idea of using relational model data-structures
with functional transforms. These are well defined, and much
literature on how to adapt their form to match your requirements is
available.
Data can change
Data-oriented design is current. It is not a representation of the
history of a problem or a solution that has been brought up to date,
nor is it the future, with generic solutions made up to handle
whatever will come along. Holding onto the past will interfere with
flexibility, and looking to the future is generally fruitless as
programmers are not fortune tellers. It's the opinion of the author,
that future-proof systems rarely are. Object-oriented design starts
to show its weaknesses when designs change in the real-world.
Object-oriented design is known to handle changes to underlying
implementation details very well, as these are the expected changes,
the obvious changes, and the ones often cited in introductions to
object-oriented design. However, real world changes such as change of
user's needs, changes to input format, quantity, frequency, and the
route by which the information will travel, are not handled with
grace. It was introduced in On the Criteria To Be Used in Decomposing
Systems into Modules[#!OnTheCriteria!#] that the modularisation
approach used by many at the time was rather like that of a
production line, where elements of the implementation are caught up
in the stages of the proposed solution. These stages themselves would
be identified with a current interpretation of the problem. In the
original document, the solution was to introduce a data hiding
approach to modularisation, and though it was an improvement, in the
later book Software Pioneers: Contributions to Software Engineering
[#!OnTheCriteriaLate!#], D. L. Parnas revisits the issue and reminds
us that even though initial software development can be faster when
making structural decisions based on business facts, it lays a burden
on maintenance and evolutionary development. Object-oriented design
approaches suffer from this inertia inherent in keeping the problem
domain coupled with the implementation. As mentioned, the problem
domain, when introduced into the implementation, can help with making
decisions quickly, as you can immediately see the impact the
implementation will have on getting closer to the goal of solving or
working with the problem in its current form. The problem with
object-oriented design lies in the inevitability of change at a
higher level.
Designs change for multiple reasons, occasionally including times
when they actually haven't. A misunderstanding of a design, or a
misinterpretation of a design, will cause as much change in the
implementation as a literal request for change of design. A
data-oriented approach to code design considers the change in design
through the lens of understanding the change in the meaning of the
data. The data-oriented approach to design also allows for change to
the code when the source of data changes, unlike the encapsulated
internal state manipulations of the object-oriented approach. In
general, data-oriented design handles change better as pieces of data
and transforms can be more simply coupled and decoupled than objects
can be mutated and reused.
The reason this is so, comes from linking the intention, or the
aspect, to the data. When lumping data and functions in with concepts
of objects, you find the objects are the schema of the data. The
aspect of the data is linked to that object, which means it's hard to
think of the data from another point of view. The use case of the
data, and the real-world or design, are now linked to the data layout
through a singular vision implied by the object definition. If you
link your data layout to the union of the required data for your
expected manipulations, and your data manipulations are linked by
aspects of your data, then you make it hard to unlink data related by
aspect. The difficulty comes when different aspects need different
subsets of the data, and they overlap. When they overlap, they create
a larger and larger set of values that need to travel around the
system as one unit. It's common to refactor a class out into two or
more classes, or give ownership of data to a different class. This is
what is meant by tying data to an aspect. It is tied to the lens
through which the data has purpose, but with static typed objects
that purpose is predefined, a union of multiple purposes, and
sometimes carries around defunct relationships. Some purposes may no
longer required by the design. Unfortunately, it's easier to see when
a relationship needs to exist, than when it doesn't, and that leads
to more connections, not fewer, over time.
If you link your operations by related data, such as when you put
methods on a class, you make it hard to unlink your operations when
the data changes or splits, and you make it hard to split data when
an operation requires the data to be together for its own purposes.
If you keep your data in one place, operations in another place, and
keep the aspects and roles of data intrinsic from how the operations
and transforms are applied to the data, then you will find that many
times when refactoring would have been large and difficult in
object-oriented code, the task now becomes trivial or non-existent.
With this benefit comes a cost of keeping tabs on what data is
required for each operation, and the potential danger of
de-synchronisation. This consideration can lead to keeping some cold
code in an object-oriented style where objects are responsible for
maintaining internal consistency over efficiency and mutability.
Examples of places where object-oriented design is far superior to
data-oriented can be that of driver layers for systems or hardware.
Even though Vulkan and OpenGL are object-oriented, the granularity of
the objects is large and linked to stable concepts in their space,
just like the object-oriented approach of the FILE type or handle, in
open, close, read, and write operations in filesystems.
A big misunderstanding for many new to the data-oriented design
paradigm, a concept brought over from abstraction based development,
is that we can design a static library or set of templates to provide
generic solutions to everything presented in this book as a
data-oriented solution. Much like with domain driven design,
data-oriented design is product and work-flow specific. You learn how
to do data-oriented design, not how to add it to your project. The
fundamental truth is that data, though it can be generic by type, is
not generic in how it is used. The values are different and often
contain patterns we can turn to our advantage. The idea that data can
be generic is a false claim that data-oriented design attempts to
rectify. The transforms applied to data can be generic to some
extent, but the order and selection of operations are literally the
solution to the problem. Source code is the recipe for conversion of
data from one form into another. There cannot be a library of
templates for understanding and leveraging patterns in the data, and
that's what drives a successful data-oriented design. It's true we
can build algorithms to find patterns in data, otherwise, how would
it be possible to do compression, but the patterns we think about
when it comes to data-oriented design are higher level,
domain-specific, and not simple frequency mappings.
Our run-time benefits from specialisation through performance tricks
that sometimes make the code harder to read, but it is frequently
discouraged as being not object-oriented, or being too hard-coded. It
can be better to hard-code a transform than to pretend it's not
hard-coded by wrapping it in a generic container and using less
direct algorithms on it. Using existing templates like this provides
a benefit of an increase in readability for those who already know
the library, and potentially fewer bugs if the functionality was in
some way generic. But, if the functionality was not well mapped to
the existing generic solution, writing it with a function template
and then extending will make the code harder to understand. Hiding
the fact that the technique had been changed subtly will introduced
false assumptions. Hard-coding a new algorithm is a better choice as
long as it has sufficient tests, and is objectively new. Tests will
also be easier to write if you constrain yourself to the facts about
concrete data and only test with real, but simple data for your
problem, and not generic types on generic data.
How is data formed?
The games we write have a lot of data, in a lot of different formats.
We have textures in multiple formats for multiple platforms. There
are animations, usually optimised for different skeletons or types of
playback. There are sounds, lights, and scripts. Don't forget meshes,
they consist of multiple buffers of attributes. Only a very small
proportion of meshes are old fixed function type with vertices
containing positions, UVs, and normals. The data in game development
is hard to box, and getting harder to pin down as more ideas which
were previously considered impossible have now become commonplace.
This is why we spend a lot of time working on editors and
tool-chains, so we can take the free-form output from designers and
artists and find a way to put it into our engines. Without our
tool-chains, editors, viewers, and tweaking tools, there would be no
way we could produce a game with the time we have. The
object-oriented approach provides a good way to wrap our heads around
all these different formats of data. It gives a centralised view of
where each type of data belongs and classifies it by what can be done
to it. This makes it very easy to add and use data quickly, but
implementing all these different wrapper objects takes time. Adding
new functionality to these objects can sometimes require large
amounts of refactoring as occasionally objects are classified in such
a way that they don't allow for new features to exist. For example,
in many old engines, textures were always 1,2, or 4 bytes per pixel.
With the advent of floating point textures, all that code required a
minor refactoring. In the past, it was not possible to read a texture
from the vertex shader, so when texture based skinning came along,
many engine programmers had to refactor their render update. They had
to allow for a vertex shader texture upload because it might be
necessary when uploading transforms for rendering a skinned mesh.
When the PlayStation2 came along, or an engine first used shaders,
the very idea of what made a material had to change. In the move from
small 3D environments to large open worlds with level of detail
caused many engineers to start thinking about what it meant for
something to need rendering. When newer hardware became more picky
about alignment, other hard to inject changes had to be made. In many
engines, mesh data is optimised for rendering, but when you have to
do mesh ray casting to see where bullets have hit, or for doing IK,
or physics, then you need multiple representations of an entity. At
this point, the object-oriented approach starts to look cobbled
together as there are fewer objects that represent real things, and
more objects used as containers so programmers can think in larger
building blocks. These blocks hinder though, as they become the only
blocks used in thought, and stop potential mental connections from
happening. We went from 2D sprites to 3D meshes, following the format
of the hardware provider, to custom data streams and compute units
turning the streams into rendered triangles. Wave data, to banks, to
envelope controlled grain tables and slews of layered sounds.
Tilemaps, to portals and rooms, to streamed, multiple levels of
detail chunks of world, to hybrid mesh palette, props, and unique
stitching assets. From flip-book to Euler angle sequences, to
quaternions and spherical interpolated animations, to animation trees
and behaviour mapping/trees. Change is the only constant.
All these types of data are pretty common if you've worked in games
at all, and many engines do provide an abstraction to these more
fundamental types. When a new type of data becomes heavily used it is
promoted into engines as a core type. We normally consider the
trade-off of new types being handled as special cases until they
become ubiquitous to be one of usability vs performance. We don't
want to provide free access to the lesser understood elements of game
development. People who are not, or can not, invest time in finding
out how best to use new features, are discouraged from using them.
The object-oriented game development way to do that is to not provide
objects which represent them, and instead only offer the features to
people who know how to utilise the more advanced tools.
Apart from the objects representing digital assets, there are also
objects for internal game logic. For every game, there are objects
which only exist to further the game-play. Collectable card games
have a lot of textures, but they also have a great deal of rules,
card stats, player decks, match records, with many objects to
represent the current state of play. All of these objects are
completely custom designed for one game. There may be sequels, but
unless it's primarily a re-skin, it will use quite different game
logic in many places, and therefore require different data, which
would imply different methods on the now guaranteed to be internally
different objects.
Game data is complex. Any first layout of the data is inspired by the
game's initial design. Once development is underway, the layout needs
to keep up with whichever way the game evolves. Object-oriented
techniques offer a quick way to implement any given design, are very
quick at implementing each singular design in turn, but don't offer a
clean or graceful way to migrate from one data schema to the next.
There are hacks, such as those used in version based asset handlers,
or in frameworks backed by update systems and conversion scripts, but
normally, game developers change the tool-chain and the engine at the
same time, do a full re-export of all the assets, then commit to the
next version all in one go. This can be quite a painful experience if
it has to happen over multiple sites at the same time, or if you have
a lot of assets, or if you are trying to provide engine support for
more than one title, and only one wants to change to the new
revision. An example of an object-oriented approach that handles
migration of design with some grace is the Django framework, but the
reason it handles the migration well is that the objects would appear
to be views into data models, not the data itself.
There have not yet been any successful efforts to build a generic
game asset solution. This may be because all games differ in so many
subtle ways that if you did provide a generic solution, it wouldn't
be a game solution, just a new language. There is no solution to be
found in trying to provide all the possible types of object a game
can use. But, there is a solution if we go back to thinking about a
game as merely running a set of computations on some data. The
closest we can get in 2018 is the FBX format, with some dependence on
the current standard shader languages. The current solutions appear
to have excess baggage which does not seem easy to remove. Due to the
need to be generic, many details are lost through abstractions and
strategies to present data in a non-confrontational way.
What can provide a computational framework for such complex data?
Game developers are notorious for thinking about game development
from either a low level all out performance perspective or from a
very high-level gameplay and interaction perspective. This may have
come about because of the widening gap between the amount of code
that has to be high performance, and the amount of code to make the
game complete. Object-oriented techniques provide good coverage of
the high-level aspect, so the high-level programmers are content with
their tools. The performance specialists have been finding ways of
doing more with the hardware, so much so that a lot of the time
content creators think they don't have a part in the optimisation
process. There has never been much of a middle ground in game
development, which is probably the primary reason why the structure
and performance techniques employed by big-iron companies didn't seem
useful. The secondary reason could be that game developers don't
normally develop systems and applications which have decade-long
maintenance expectations^1.4 and therefore are less likely to be
concerned about why their code should be encapsulated and protected
or at least well documented. When game development was first
flourishing into larger studios in the late 1990's, academic or
corporate software engineering practices were seen as suspicious
because wherever they were employed, there was a dramatic drop in
game performance, and whenever any prospective employees came from
those industries, they failed to impress. As games machines became
more like the standard micro-computers, and standard micro-computers
drew closer in design to the mainframes of old, the more apparent it
became that some of those standard professional software engineering
practices could be useful. Now the scale of games has grown to match
the hardware, but the games industry has stopped looking at where
those non-game development practices led. As an industry, we should
be looking to where others have gone before us, and the closest set
of academic and professional development techniques seem to be
grounded in simulation and high volume data analysis. We still have
industry-specific challenges such as the problems of high frequency
highly heterogeneous transformational requirements that we experience
in sufficiently voluminous AI environments, and we have the issue of
user proximity in networked environments, such as the problems faced
by MMOs when they have location-based events, and bandwidth starts to
hit $ n^2$ issues as everyone is trying to message everyone else.
With each successive generation, the number of developer hours to
create a game has grown, which is why project management and software
engineering practices have become standardised at the larger games
companies. There was a time when game developers were seen as
cutting-edge programmers, inventing new technology as the need
arises, but with the advent of less adventurous hardware (most
notably in the x86 based recent 8^thgenerations), there has been a
shift away from ingenious coding practices, and towards a
standardised process. This means game development can be tuned to
ensure the release date will coincide with marketing dates. There
will always be an element of randomness in high profile game
development. There will always be an element of innovation that
virtually guarantees you will not be able to predict how long the
project, or at least one part of the project, will take. Even if
data-oriented design isn't needed to make your game go faster, it can
be used to make your game development schedule more regular.
Part of the difficulty in adding new and innovative features to a
game is the data layout. If you need to change the data layout for a
game, it will need objects to be redesigned or extended in order to
work within the existing framework. If there is no new data, then a
feature might require that previously separate systems suddenly be
able to talk to each other quite intimately. This coupling can often
cause system-wide confusion with additional temporal coupling and
corner cases so obscure they can only be reproduced one time in a
million. These odds might sound fine to some developers, but if
you're expecting to sell five to fifty million copies of your game,
at one in a million, that's five to fifty people who will experience
the problem, can take a video of your game behaving oddly, post it on
the YouTube, and call your company rubbish, or your developers lazy,
because they hadn't fixed an obvious bug. Worse, what if the one in a
million issue was a way to circumvent in-app-purchases, and was
reproducible if you knew what to do and the steps start spreading on
Twitter, or maybe created an economy-destroying influx of resources
in a live MMO universe^1.5. In the past, if you had sold five to
fifty million copies of your game, you wouldn't care, but with the
advent of free-to-play games, five million players might be
considered a good start, and poor reviews coming in will curb the
growth. IAP circumventions will kill your income, and economy
destruction will end you.
Big iron developers had these same concerns back in the 1970's. Their
software had to be built to high standards because their programs
would frequently be working on data concerned with real money
transactions. They needed to write business logic that operated on
the data, but most important of all, they had to make sure the data
was updated through a provably careful set of operations in order to
maintain its integrity. Database technology grew from the need to
process stored data, to do complex analysis on it, to store and
update it, and be able to guarantee it was valid at all times. To do
this, the ACID test was used to ensure atomicity, consistency,
isolation, and durability. Atomicity was the test to ensure all
transactions would either complete or do nothing. It could be very
bad for a database to update only one account in a financial
transaction. There could be money lost or created if a transaction
was not atomic. Consistency was added to ensure all the resultant
state changes which should happen during a transaction do happen,
that is, all triggers which should fire, do fire, even if the
triggers cause triggers recursively, with no limit. This would be
highly important if an account should be blocked after it has
triggered a form of fraud detection. If a trigger has not fired, then
the company using the database could risk being liable for even more
than if they had stopped the account when they first detected fraud.
Isolation is concerned with ensuring all transactions which occur
cannot cause any other transactions to differ in behaviour. Normally
this means that if two transactions appear to work on the same data,
they have to queue up and not try to operate at the same time.
Although this is generally good, it does cause concurrency problems.
Finally, durability. This was the second most important element of
the four, as it has always been important to ensure that once a
transaction has completed, it remains so. In database terminology,
durability meant the transaction would be guaranteed to have been
stored in such a way that it would survive server crashes or power
outages. This was important for networked computers where it would be
important to know what transactions had definitely happened when a
server crashed or a connection dropped.
Modern networked games also have to worry about highly important data
like this. With non-free downloadable content, consumers care about
consistency. With consumable downloadable content, users care a great
deal about every transaction. To provide much of the functionality
required of the database ACID test, game developers have gone back to
looking at how databases were designed to cope with these strict
requirements and found reference to staged commits, idempotent
functions, techniques for concurrent development, and a vast
literature base on how to design tables for a database.
Conclusions and takeaways
We've talked about data-oriented design being a way to think about
and lay out your data and to make decisions about your architecture.
We have two principles that can drive many of the decisions we need
to make when doing data-oriented design. To finish the chapter, there
are some takeaways you can use immediately to begin your journey.
Consider how your data is being influenced by what it's called.
Consider the possibility that the proximity of other data can
influence the meaning of your data, and in doing so, trap it in a
model that inhibits flexibility. For the consideration of the first
principle, data is not the problem domain, it's worth thinking about
the following items.
[noitemsep,nolistsep]
* What is tying your data together, is it a concept or implied
meaning?
* Is your data layout defined by a single interpretation from a
single point of view?
* Think about how the data could be reinterpreted and cut along
those lines.
* What is it about the data that makes it uniquely important?
You are not targeting an unknown device with unknowable
characteristics. Know your data, and know your target hardware. To
some extent, understand how much each stream of data matters, and who
is consuming it. Understand the cost and potential value of
improvements. Access patterns matter, as you cannot hit the cache if
you're accessing things in a burst, then not touching them again for
a whole cycle of the application. For the consideration of the second
principle, data is the type, frequency, quantity, shape, and
probability, it's worth thinking about the following items.
[noitemsep,nolistsep]
* What is the smallest unit of memory on your target platform?^1.6
* When you read data, how much of it are you using?
* How often do you need the data? Is it once, or a thousand times a
frame?
* How do you access the data? At random, or in a burst?
* Are you always modifying the data, or just reading it? Are you
modifying all of it?
* Who does the data matter to, and what about it matters?
* Find out the quality constraints of your solutions, in terms of
bandwidth and latency.
* What information do you have that isn't in the data per-se? What
is implicit?
---------------------------------------------------------------------
next up previous contents
Next: Relational Databases Up: Data-Oriented Design Previous:
Contents Contents
Online release of Data-Oriented Design :
This is the free, online, reduced version. Some inessential chapters
are excluded from this version, but in the spirit of this being an
education resource, the essentials are present for anyone wanting to
learn about data-oriented design.
Expect some odd formatting and some broken images and listings as
this is auto generated and the Latex to html converters available are
not perfect. If the source code listing is broken, you should be able
to find the referenced source on github. If you like what you read
here, consider purchasing the real paper book from here, as not only
will it look a lot better, but it will help keep this version online
for those who cannot afford to buy it. Please send any feedback to
support@dataorienteddesign.com
Richard Fabian 2018-10-08