https://umarbutler.com/mapping-almost-every-law-regulation-and-case-in-australia/

Umar Butler

  * Blog

  * About

  * GitHub
  * LinkedIn

Mapping (almost) every law, regulation and case in Australia

Mar 21, 2024
Data , Law [?][?]

What if you could take every law, regulation and case in Australia
and project them onto a two-dimensional map such that their distance
from one another was proportional to their similarity in meaning?
What would that look like?

Perhaps something like this.

The first ever map of Australian law.The first ever map of Australian
law.

Note: since you're on a mobile device, the map has been replaced with
a screenshot. Hop on a computer to experience the map in all its
interactive glory.

This is the first ever map of Australian law. Each point represents a
unique law, regulation or case in the Open Australian Legal Corpus,
the world's largest open source database of Australian law (you can
learn about how I built that corpus here).

The closer any two documents are on the map, the more similar they
are in meaning.

If you're on a computer and hover over a document, you'll see its
title, type, jurisdiction and category. You can open a document by
clicking on it.

Documents are coloured by category. The legend on the right shows
what colour each category correspond to. Click on a category and
you'll exclude it from the map. Double click and you'll only see
documents from that category.

Over the course of this article, I'll cover exactly what the map can
teach us about Australian law as well as give you a behind-the-scenes
look at how I built it, providing code examples along the way. Those
more interested in the technology powering the map can skip straight
to that section here.

What can we learn from it?

While it might not look like much at first, the map gives us a rare
look into some of the many hidden ways Australian laws, regulations
and cases are both connected to and disconnected from one another.

The invisible barrier between cases and legislation

It is readily observable, for example, that there is a sort of
invisible barrier separating cases on the one hand from legislation
on the other. This barrier corresponds roughly with the map's north
and south poles.

[yH5BAEAAAA][types]An annotated version of the map where cases and
legislation are enclosed in two shapes corresponding with the map's
north and south poles, respectively.

The presence of this barrier tells us that documents of the same type
will tend to share more in common with each other than they will with
documents of the same subject matter.

Although they may often focus on the same topics, cases and
legislation are, after all, written in different styles, towards
different ends.

The absence of borders between documents of different jurisdictions

Interestingly, however, we find no such borders between documents of
different jurisdictions; although, it is worth noting, due to
copyright restrictions, the Open Australian Legal Corpus only
contains decisions from the Commonwealth and New South Wales and is
missing legislation from Victoria, the Northern Territory and the
Australian Capital Territory.

[yH5BAEAAAA][jurisdicti]An alternate view of the map where documents
are coloured by jurisdiction, illustrating the lack of boundaries
between documents of different jurisdictions.

The absence of borders between cases and legislation of different
jurisdictions indicates that Australian state and federal law is
relatively homogenous. There are no differences between the style,
principles of interpretation or general jurisprudence of state and
federal law that appear to be reflected in the map. Of what borders
do exist between state and federal law, they correspond better with
differences in subject matter than they do with the jurisprudence of
their jurisdictions.

This conforms with fact that state and federal courts and
legislatures operate within a single legal framework, under which
they have jurisdiction over matters prescribed by the Constitution in
their territory, with a single court, the High Court of Australia,
arbitrating on disputes between governments over the precise limits
of those constitutional rights and powers.

The judicial and legislative mainlands and islands

Turning back to the barrier between cases and legislation, we also
observe that, within the map's north and south poles, each pole has a
'mainland' of sorts that most documents belong to, and then there are
a range of 'islands' that orbit those mainlands, typically consisting
of documents of the same subject matter.

[yH5BAEAAAA][islands]An annotated version of the map where 'islands'
of documents of the same subject matter are enclosed in shapes that
orbit 'mainlands' of cases and legislation.

The fact that there are judicial and legislative mainlands suggests
that most cases and legislation draw from and feed into a single,
interconnected pool of knowledge.

This is not particularly surprising. What is surprising is that there
are large islands of legislation and judgments that are entirely cut
off from their respective mainlands.

Tariff concession orders, for example, form their very own unique
archipelago, perhaps because each order is centred around regulating
a distinct, often quite technical class of importable goods, from
magazine holders to forklifts.

There is also quite a sizeable island of airworthiness directives
primarily focused on regulating aircraft components, another highly
technical domain.

Somewhat unexpectedly, the largest island by surface area consists
almost entirely of migration cases. Furthermore, of all 19 possible
branches of law, migration and family law are the only two to be
found more often outside a mainland than inside one.

Migration and family law are, in effect, the most isolated areas of
Australian law on the map.

Funnily enough, while researching why that might be, I stumbled upon
this rather pertinent quote from Lord Sumption:

    Courts exercising family jurisdiction do not occupy a desert
    island in which general legal concepts are suspended or mean
    something different. If a right of property exists, it exists in
    every division of the High Court and in every jurisdiction of the
    county courts. If it does not exist, it does not exist anywhere.

    Prest v Petrodel Resources Ltd [2013] UKSC 34, [37] (emphasis
    added)

I also discovered that Munby LJ, later President of the UK Family
Division, had likewise once quipped:

    The Family Division is part of the High Court. It is not some
    legal Alsatia where the common law and equity do not apply. The
    rules of agency apply there as much as elsewhere. But in applying
    those rules one must have regard to the context ...

    Richardson v Richardson [2011] EWCA Civ 79, [53]

It would seem that there was already a perception that family law is
somewhat isolated from the rest of the law, which the map appears to
support.

As for migration law, although I was unable to locate equally apropos
quotes, from my own review of a selection of cases on the map, they
appear relatively self-contained in that they tend to reference
legislation and cases particular to migration law. It also makes
sense that migration law would be a little distant from other areas
of law given its unique subject matter.

While not as insular as family and migration law, it is also worth
addressing the relatively large hexagram-shaped island of criminal
law, which features a tail of transport and administrative law cases
coming out of it.

That island appears to consist mostly of substantive criminal law
cases (along with certain punitive transport and administrative law
cases focused on the suspension of various types of licences),
whereas the criminal law cases connected to the judicial mainland
tend to concern criminal procedure.

[yH5BAEAAAA][criminal]An annotated version of criminal law cases with
the island of substantive criminal law cases and the cluster of
procedural criminal law cases connected to the judicial mainland
enclosed in their own shapes. Only light blue data points are
criminal cases.

This supports the broad division of substantive law into criminal law
and civil law while also conforming well with the fact that criminal
procedure law and civil procedure law share a number of common
principles of natural justice.

The most and least legislative areas of judicial law

Fascinatingly, migration, family and substantive criminal law also
all tend to cluster closely together latitudinally, hinting at
potential hidden connection. They are all known to overlap in certain
ways and they all share a special focus on regulating the lives of
individuals, and not merely the property rights of legal persons.

Migration, family and substantive criminal law cases also all happen
to be the most distant types of cases from legislation on the map. Of
course, this does not mean that they never cite legislation, but it
may be that they rely on precedent more often than other areas of
case law. It might also be the result of the inherent difficulty in
attempting to represent highly complex and multidimensional
relationships in a simple two-dimensional map.

Conversely, the class of cases closest to legislation is development
cases, which makes sense since they can often deal quite intimately
with local planning laws and regulations.

The case law continuum

If we start at the bottom of the cases mainland and make our way up,
we can also see that Australian case law is a continuum of sorts.

[yH5BAEAAAA][judicial-m]An annotated version of the case law mainland
where select branches of law are pointed out, illustrating the
continuum of case law.

Development cases connect with environmental cases, which then link
with land cases.

Land cases border contract cases which in turn have procedural cases
to their north, intellectual property cases to their west and
commercial cases to their east.

Moving further north of procedural law brings you to criminal law and
defamation.

Heading west from intellectual property law takes you through
administrative law, health and social services law, employment law,
negligence and finally transport law.

Going east of commercial law, you'll find equity and a subset of
family law.

[yH5BAEAAAA][compressed_animation]An animation of branches of law
appearing on the map sequentially, illustrating the continuum of case
law.

This continuum corresponds well with our pre-existing understandings
of the relationships between the various branches of the law.

It makes sense, for example, that development, environment and land
law would all be intertwined given their similar subject matter.
Likewise, it is not at all unexpected that negligence would cluster
closely with transport and employment law when a great many
negligence cases centre around motor and workplace accident claims.

The map, in effect, crystallises our own mental models of the law.

It also shows us that the borders between various areas of the law
can often be quite porous. We notice, for instance, that there is a
streak of land law judgments that overlaps with commercial and
procedural law cases and is disconnected from most other land law
cases. Interestingly, cases in this streak tend to focus on mortgage
disputes often involving defaults which would explain why they
overlap with commercial and procedural law cases.

We can also see that there are some transport law judgments that are
connected to the cases mainland and then there are others that are
connected to the island of substantive criminal law cases. Transport
judgments connected to that island often centre around the suspension
of transport licences, whereas judgments connected to the cases
mainland tend to focus on transport accidents. Although disconnected
from one another, however, both clusters of transport cases are still
relatively close to each other, reflecting their shared subject
matter.

Final thoughts

By now, we've covered how the map reflects already known distinctions
between cases and legislation, while also revealing potential new
divisions and hidden connections between various areas of the law.
We've also seen how Australian case law can be more of a continuum
than a rigidly defined structure and how the borders between branches
of case law can often be quite porous.

Other specific insights we've been able to find are that:

  * Migration, family and substantive criminal law are the most
    isolated branches of case law on the map;
  * Migration, family and substantive criminal law are the most
    distant branches of case law from legislation on the map;
  * Development law is the closest branch of case law to legislation
    on the map; and
  * The map does not reveal any noticeable distinctions between
    Australian state and federal law, whether it be in style,
    principles of interpretation or general jurisprudence.

These are but a selection of the most readily observable insights to
be gained from attempting to map Australian law. There are no doubt
countless others waiting to be uncovered. Producing a
three-dimensional map of Australian laws, cases and regulations
could, for example, reveal new hidden relationships that are almost
impossible to represent in two dimensions. Adding cases and
legislation from other states and territories might also give us a
sharper, higher resolution image of the map, deepening our
understanding of the geography of Australian law. One could even
imagine adding legal documents from other common law countries such
as the UK, Canada and New Zealand to, in a sense, photograph the
historic and continued interactions between our legal systems.

Nevertheless, for a first attempt, the map already has a lot to teach
us. Perhaps you've even identified patterns in the map that I could
not.

The greatest thing about this exercise is that it can be applied to
virtually any domain, not just Australian law. Semantic mapping is
particularly useful for very quickly developing an understanding of
the underlying composition and structure of a dataset without having
to manually scour through hundreds of examples to build your own much
noisier and less persistent mental model of that data.

Since finishing the map, I've already been able to reuse this
technique to study countless other seemingly unstructured large
datasets, and you can too. It doesn't take an expert in clustering
and mapping to pull it off. Far from it. Prior to starting this
project, I didn't know the first thing about semantic mapping and now
I'm about to teach you how to do it yourself.

So how'd you do it?

At a high level, the process for mapping any arbitrary set of data
points, whether they be PDFs, YouTube videos, TikToks or anything
else, can be broken down into six stages, illustrated below.

[yH5BAEAAAA][process]An illustration of the process of semantically
mapping data.

In brief, we try to represent the meaning of data in the form of sets
of numbers (vectorisation), after which we group those sets into
clusters based on their similarity (clustering) and subsequently
label those clusters based on whatever unique patterns we can find in
them (labelling). Finally, we project the numerical representations
of the data into two-dimensional coordinates (dimensionality
reduction) which we then plot on a map (visualisation).

Through this next section, we'll take a deeper look at exactly how
every step of the semantic mapping process works in practice. Before
that though, I'd like to express my gratitude to the creators of
BERTopic, a topic modelling technique which this process was loosely
based on, as well as Dr Mike DeLong whose topic map of the Open
Australian Legal Corpus served as the inspiration for this entire
project.

Vectorisation

The first step in semantically mapping a dataset is to vectorise its
data points.

In this context, vectorisation refers to the process of converting
information into a set of numbers intended to represent its
underlying meaning, known as a vector or embedding. By calculating
how similar vectors are to one another, we can also get a rough idea
of how similar they are in meaning. This principle is what allows us
to later group data points into clusters and project them onto a
two-dimensional map.

To vectorise a data point, we can use an embedding model, a model
specially trained for the task of representing the meaning of
information as vectors. Thankfully, for my uses and probably yours
too, it isn't necessary to train a custom embedding model or pay
someone to use theirs. At least for text vectorisation, most of the
world's best models are already available for free and under
open-source licences.

Hugging Face helpfully maintains a ranked list of the most accurate
text embedding models as benchmarked against hundreds of datasets,
known as the Massive Text Embedding Benchmark (MTEB) Leaderboard.
When I built the map, BAAI/bge-small-en-v1.5 was one of the best
open-source models available for its size, so that's what I went
with. Nowadays, avsolatorio/GIST-small-Embedding-v0 (a finetune of
that model) ranks higher, but its worth checking out the leaderboard
yourself as new models are released every day.

One constraint of contemporary text embedding models worth keeping in
mind is that they can only vectorise a fixed number of tokens, known
as a context window. If you don't know what a token is, you can think
of it as the most basic unit of input a text model can take. There
are roughly 0.75 tokens in a word. So, if a text embedding model's
context window is 512 tokens like GIST-small-Embedding-v0's is, then
you can only vectorise roughly 384 words at a time.

To get around this, we can split text into chunks up to 512 tokens
long, vectorise each chunk and then average those vectors to produce
a mean text embedding that represents the average meaning of the
text. This process can also be applied to vectorise videos and audio
clips longer than what an embedding model can take as input or really
any other type of embeddable data.

[yH5BAEAAAA][OALM-Vecto]An illustration of the process of producing a
mean embedding.

In splitting up our long-form data, however, it is essential that we
do so in as meaningful a way as possible. Simply breaking up text at
every 512th-token or, if we're working with audiovisual data,
512th-second, could result in the loss of semantically important
information. Imagine if we ended up splitting the sentence 'I love to
eat kangaroo gummies, they're my favourite snack' at the word
'kangaroo', resulting in the chunks 'I love to eat kangaroo' and
'gummies, they're my favourite snack'. The final embedding would no
doubt be quite dissimilar from the text's actual meaning.

Ideally, we'd like for our data to have already been divided into
semantically meaningful sections that are all under our model's
context window. Realistically though, our data may not have any
sections at all or, if it does have sections, not all may fit within
the model's context window. In such cases, we can first split our
data into whatever parts we do have and then use a semantic chunker
to bring whatever data is over the context window, under it.

For text data, I'd recommend semchunk, an extremely fast and
lightweight Python library I developed to split millions of tokens
worth of text into chunks as semantically meaningful as possible in a
matter of seconds. It works by searching for sequences of characters
known to indicate semantic breaks in text such as consecutive
newlines and tabs, and then recursively splitting at those sequences
until a text is under the given chunk size.

The code snippet below demonstrates exactly how you can use semchunk
to split a dataset of documents into chunks with any given Hugging
Face text embedding model of your choice. Just make sure you have
semchunk and transformers installed beforehand.

After chunking our data, we still need to vectorise it and then
average those vectors such that [1, 2] and [3, 4] becomes [2, 3] (not
[7]). This is how'd you do that in practice, keeping in mind this
code also requires torch and tqdm:

Dimensionality reduction

Now that we've vectorised our data, the next step is to reduce its
dimensionality. Dimensionality reduction is where you take a really
long vector like [4, 2, 1, 5, ..] and turn it into lower-dimension
coordinates like [4, 2]. This is how we map our data. It also makes
our data easier to cluster later on, since high-dimensional data can
often be quite tricky to cluster due to the 'curse of dimensionality
'.

To reduce the data's dimensionality, we use a dimensionality
reduction model, a model capable of projecting high-dimensional data
into low-dimensional spaces while preserving as much meaningful
information as possible.

The model I used was PaCMAP, which benchmarks as one of the fastest
and most accurate dimensionality reduction models, capable of
preserving both global and local structures in underlying datasets.
The visualisation below, courtesy of their GitHub repository, shows
what it looks like when you attempt to reduce a three-dimensional
model of a mammoth down to two dimensions with PaCMAP (visible on the
far right) and other popular dimensionality reduction models.

[yH5BAEAAAA][PaCMAP-Mammoth]A visualisation of the reduction of a
three-dimensional model of a mammoth down to two dimensions with the
most popular dimensionality reduction models, courtesy of the
Apache-2.0 licensed PaCMAP GitHub repository.

Now, because we're using PaCMAP for two different purposes, namely,
to map the data and to make it easier to cluster, we can reduce our
vectors to two different dimensions.

For mapping, two dimensions is what I went with but it's also
possible to visualise three.

For clustering, I choose 80 because my clusters seemed to benefit
from high-dimensional data and 80 was the most dimensions I could use
without slowing my PC down too much. What worked for me, however, may
not work for you. With another dataset of ~400 data points, much
lower than the Open Australian Legal Corpus' 200k documents, I had
found that 2 dimensions worked considerably better than 80. It is
worth testing a range of dimensions to see what yields the best
clusters for your data.

After installing the pacmap Python package, you can use the following
code to reduce the dimensionality of your data for both mapping and
later clustering it:

Clustering

Once the dimensionality of your data has been reduced, we can use a
clustering model to group it into clusters of data points that are
close together in our vector space. These clusters tend to correlate
with the broad set of topics and themes present in a dataset.

There are a range of clustering models to choose from, each with
their own unique advantages and drawbacks. I ended up settling on
HDBSCAN, which is generally well-regarded. They have a page in their
documentation which covers their differences with other popular
clustering methods. The most notable differentiator is that, unlike
older algorithms such as k-means, HDBSCAN does not force every data
point into a cluster, which is quite reasonable. There will always be
data points that don't quite fit into a known box. Forcing them into
boxes just makes those boxes noisier.

The only problem with HDBSCAN is that it can sometimes be overzealous
in refusing to cluster data points. In my case, there were 218,336
legal texts in the Open Australian Legal Corpus at the time that I
produced the map and 84,780 (38.8%) could not be clustered. A further
10,100 (4.6%) were placed in clusters that did not appear to have any
meaningful unifying features. In total, there were 94,880 (43.4%)
documents that could not be assigned to a meaningful cluster and so
were excluded from the map.

This is what the map would have looked like if I had included them.

[yH5BAEAAAA][unassigned]A version of the map with documents without
meaningful clusters included.

It is clear that there were documents that HDBSCAN could and should
have clustered, such as those part of the criminal and family law
islands. This is likely the result of the curse of dimensionality but
it may also be because I used fast_hdbscan, a faster implementation
of HDBSCAN that I later discovered has a tendency to produce patchier
clusters than regular HDBSCAN.

Accordingly, I'll be using regular HDBSCAN in my code example.

You will notice that there are two hyperparameters that may be tuned,
min_cluster_size and min_samples. Thanks to this Reddit comment and
my own experimentation, I have found that min_samples should approach
log(n) the noisier your dataset is, where n is the number of data
points in the dataset. For clean data, it is appropriate to set
min_samples to 1, which is what I used.

min_cluster_size refers to the minimum number of data points that
should be in a cluster. It is possible to yield meaningful clusters
with both low and high minimum cluster sizes. It may be thought of as
a control on how generalised clusters should be. For granular
clusters reflecting specific topics in your data, a low minimum
cluster size is preferable. For a couple broad clusters reflecting
general themes in your data, a higher minimum cluster size is
advised.

I ended up going with a cluster size of 50, which, in proportion to
the size of my database, was minuscule. I only did this because I
wanted to manually merge clusters myself in order to ensure that the
final clusters were both as broad as I wanted them to be and were as
precise as they could be. This resulted in 507 unique clusters
(excluding the unassigned cluster), which I manually whittled down to
19 branches of law. I'll get into how I pulled that off in the next
section but for now, here is the code for how HDBSCAN can be used to
cluster vectors:

Labelling

After clustering your data, you'll want to assign some meaningful
labels to those clusters. This means figuring out exactly what it is
that they share in common.

There are a number of techniques to choose from for identifying
meaningful labels for clusters, including the use of tf-idf,
generative AI-based labellers and of course hand labelling. I won't
get into all the options available, I'll just cover what worked for
me.

First, I identified the top tokens in each cluster by their tf-idf,
which is a measure of a token's frequency in a cluster weighted by
its overall frequency in a dataset, such that tokens that are
extremely common in only one cluster will have a higher tf-idf for
that cluster than tokens that are extremely common in all clusters.
This served as an easy way to quickly associate clusters with labels
reflecting their unique composition.

With the top tokens by tf-idf in hand, I merged any clusters whose
top four tokens were the same, which got rid of just 2 of the 507
clusters. Next, I manually reviewed the clusters in order to produce
a set of 337 rules on how to merge them based on their top tokens.

In manually merging clusters, I tried my best to be as agnostic as
possible on what the final set of categories should look like. The
idea was to let the data guide me, rather than me guiding the data.
As the number of clusters began to dwindle, however, I soon found
myself forced to make increasingly difficult decisions about what
categories to include and exclude, such as whether it would better to
roll up health and social services law into a single area of law
rather than creating a commercial law category out of tax, finance
and insolvency law. I originally wanted to include many more areas of
law than the 19 branches you see on the map now, but I was ultimately
constrained by the fact that even visualising 19 easily
distinguishable categories on a map with contiguous continents is no
small feat (it was only thanks to a colour palette published by
fellow data scientist Sasha Trubetskoy that I managed to pull it
off!).

Eventually, I ended up settling on a set of clusters that I felt were
a reasonable way of dividing up my map, although I recognise it may
not necessarily be the most optimal way, if such an optimum even
exists.

This merging process was one of the most difficult components of
building the map, second only to writing this article, but it also
taught me a lot not only about the broad makeup of Australian law but
also about the many different ways law in general can be sliced up.

I would only recommend manually merging clusters if you also want to
develop an intimate understanding of the composition of your data or
if it is particularly important to you that the final product be as
precise and accurate as possible. Otherwise, it would be much more
beneficial to tune your clustering model to produce a more manageable
amount of clusters. You can then automatically label those clusters
by either taking their top three tokens by tf-idf as their label or
using a large language model to generate more coherent labels from
those tokens.

In the code snippet below, I show how you can identify the top tokens
in a cluster by tf-idf. Please note that the code relies on nltk.

If you wanted to take your automated labelling a step further, you
could also use the following code to get GPT-4 (or another OpenAI
API-compatible model) to generate labels for you, keeping in mind
that this code requires the openai and semchunk Python libraries.

Visualisation

At this point, the only piece of the puzzle left is to visualise your
map. There are a number of Python libraries capable of producing two-
and three-dimensional scatter plots, but none of them are
particularly impressive, including the library I eventually settled
on, which was Plotly.

My main grievance with Plotly is that it does not let you expand the
size of data points when zooming into a map. This really becomes an
issue where you have hundreds of thousands of data points and you
find that they either overlap with each other or, if you reduce their
size, they become almost impossible to identify when zoomed in. There
is a 3-year-old GitHub issue concerning this problem but it doesn't
look like it will get solved anytime soon.

There were other less severe issues I experienced with Plotly that I
was able to work around with custom CSS and Javascript. I won't
provide that code at the moment as it is not particularly pretty, but
I will share a code snippet illustrating how Plotly can be used to
visualise mapped data:

With that done, you should now have your very semantic map. The next
step is to analyse it. Look for patterns in the map's geography,
inspect outliers, search for islands, get a sense of the underlying
structure of your data.

As my own analysis has shown, there is a lot you can learn just by
mapping a dataset. And, once you get the ball rolling, it can quickly
spiral into an addition. I have hundreds of ideas for how to expand
my map to uncover new relationships. The real power of semantic
mapping comes out when you apply it against very large datasets.
Imagine applying these techniques on the Common Crawl corpus, for
example. You would be able to produce a first-of-its-kind
high-resolution map of the internet.

If you do end up publishing your own semantic map, be sure to cite
this article to enable others to learn about the power of semantic
mapping.

Otherwise, happy mapping!