https://umarbutler.com/mapping-almost-every-law-regulation-and-case-in-australia/ Umar Butler * Blog * About * GitHub * LinkedIn Mapping (almost) every law, regulation and case in Australia Mar 21, 2024 Data , Law [?][?] What if you could take every law, regulation and case in Australia and project them onto a two-dimensional map such that their distance from one another was proportional to their similarity in meaning? What would that look like? Perhaps something like this. The first ever map of Australian law.The first ever map of Australian law. Note: since you're on a mobile device, the map has been replaced with a screenshot. Hop on a computer to experience the map in all its interactive glory. This is the first ever map of Australian law. Each point represents a unique law, regulation or case in the Open Australian Legal Corpus, the world's largest open source database of Australian law (you can learn about how I built that corpus here). The closer any two documents are on the map, the more similar they are in meaning. If you're on a computer and hover over a document, you'll see its title, type, jurisdiction and category. You can open a document by clicking on it. Documents are coloured by category. The legend on the right shows what colour each category correspond to. Click on a category and you'll exclude it from the map. Double click and you'll only see documents from that category. Over the course of this article, I'll cover exactly what the map can teach us about Australian law as well as give you a behind-the-scenes look at how I built it, providing code examples along the way. Those more interested in the technology powering the map can skip straight to that section here. What can we learn from it? While it might not look like much at first, the map gives us a rare look into some of the many hidden ways Australian laws, regulations and cases are both connected to and disconnected from one another. The invisible barrier between cases and legislation It is readily observable, for example, that there is a sort of invisible barrier separating cases on the one hand from legislation on the other. This barrier corresponds roughly with the map's north and south poles. [yH5BAEAAAA][types]An annotated version of the map where cases and legislation are enclosed in two shapes corresponding with the map's north and south poles, respectively. The presence of this barrier tells us that documents of the same type will tend to share more in common with each other than they will with documents of the same subject matter. Although they may often focus on the same topics, cases and legislation are, after all, written in different styles, towards different ends. The absence of borders between documents of different jurisdictions Interestingly, however, we find no such borders between documents of different jurisdictions; although, it is worth noting, due to copyright restrictions, the Open Australian Legal Corpus only contains decisions from the Commonwealth and New South Wales and is missing legislation from Victoria, the Northern Territory and the Australian Capital Territory. [yH5BAEAAAA][jurisdicti]An alternate view of the map where documents are coloured by jurisdiction, illustrating the lack of boundaries between documents of different jurisdictions. The absence of borders between cases and legislation of different jurisdictions indicates that Australian state and federal law is relatively homogenous. There are no differences between the style, principles of interpretation or general jurisprudence of state and federal law that appear to be reflected in the map. Of what borders do exist between state and federal law, they correspond better with differences in subject matter than they do with the jurisprudence of their jurisdictions. This conforms with fact that state and federal courts and legislatures operate within a single legal framework, under which they have jurisdiction over matters prescribed by the Constitution in their territory, with a single court, the High Court of Australia, arbitrating on disputes between governments over the precise limits of those constitutional rights and powers. The judicial and legislative mainlands and islands Turning back to the barrier between cases and legislation, we also observe that, within the map's north and south poles, each pole has a 'mainland' of sorts that most documents belong to, and then there are a range of 'islands' that orbit those mainlands, typically consisting of documents of the same subject matter. [yH5BAEAAAA][islands]An annotated version of the map where 'islands' of documents of the same subject matter are enclosed in shapes that orbit 'mainlands' of cases and legislation. The fact that there are judicial and legislative mainlands suggests that most cases and legislation draw from and feed into a single, interconnected pool of knowledge. This is not particularly surprising. What is surprising is that there are large islands of legislation and judgments that are entirely cut off from their respective mainlands. Tariff concession orders, for example, form their very own unique archipelago, perhaps because each order is centred around regulating a distinct, often quite technical class of importable goods, from magazine holders to forklifts. There is also quite a sizeable island of airworthiness directives primarily focused on regulating aircraft components, another highly technical domain. Somewhat unexpectedly, the largest island by surface area consists almost entirely of migration cases. Furthermore, of all 19 possible branches of law, migration and family law are the only two to be found more often outside a mainland than inside one. Migration and family law are, in effect, the most isolated areas of Australian law on the map. Funnily enough, while researching why that might be, I stumbled upon this rather pertinent quote from Lord Sumption: Courts exercising family jurisdiction do not occupy a desert island in which general legal concepts are suspended or mean something different. If a right of property exists, it exists in every division of the High Court and in every jurisdiction of the county courts. If it does not exist, it does not exist anywhere. Prest v Petrodel Resources Ltd [2013] UKSC 34, [37] (emphasis added) I also discovered that Munby LJ, later President of the UK Family Division, had likewise once quipped: The Family Division is part of the High Court. It is not some legal Alsatia where the common law and equity do not apply. The rules of agency apply there as much as elsewhere. But in applying those rules one must have regard to the context ... Richardson v Richardson [2011] EWCA Civ 79, [53] It would seem that there was already a perception that family law is somewhat isolated from the rest of the law, which the map appears to support. As for migration law, although I was unable to locate equally apropos quotes, from my own review of a selection of cases on the map, they appear relatively self-contained in that they tend to reference legislation and cases particular to migration law. It also makes sense that migration law would be a little distant from other areas of law given its unique subject matter. While not as insular as family and migration law, it is also worth addressing the relatively large hexagram-shaped island of criminal law, which features a tail of transport and administrative law cases coming out of it. That island appears to consist mostly of substantive criminal law cases (along with certain punitive transport and administrative law cases focused on the suspension of various types of licences), whereas the criminal law cases connected to the judicial mainland tend to concern criminal procedure. [yH5BAEAAAA][criminal]An annotated version of criminal law cases with the island of substantive criminal law cases and the cluster of procedural criminal law cases connected to the judicial mainland enclosed in their own shapes. Only light blue data points are criminal cases. This supports the broad division of substantive law into criminal law and civil law while also conforming well with the fact that criminal procedure law and civil procedure law share a number of common principles of natural justice. The most and least legislative areas of judicial law Fascinatingly, migration, family and substantive criminal law also all tend to cluster closely together latitudinally, hinting at potential hidden connection. They are all known to overlap in certain ways and they all share a special focus on regulating the lives of individuals, and not merely the property rights of legal persons. Migration, family and substantive criminal law cases also all happen to be the most distant types of cases from legislation on the map. Of course, this does not mean that they never cite legislation, but it may be that they rely on precedent more often than other areas of case law. It might also be the result of the inherent difficulty in attempting to represent highly complex and multidimensional relationships in a simple two-dimensional map. Conversely, the class of cases closest to legislation is development cases, which makes sense since they can often deal quite intimately with local planning laws and regulations. The case law continuum If we start at the bottom of the cases mainland and make our way up, we can also see that Australian case law is a continuum of sorts. [yH5BAEAAAA][judicial-m]An annotated version of the case law mainland where select branches of law are pointed out, illustrating the continuum of case law. Development cases connect with environmental cases, which then link with land cases. Land cases border contract cases which in turn have procedural cases to their north, intellectual property cases to their west and commercial cases to their east. Moving further north of procedural law brings you to criminal law and defamation. Heading west from intellectual property law takes you through administrative law, health and social services law, employment law, negligence and finally transport law. Going east of commercial law, you'll find equity and a subset of family law. [yH5BAEAAAA][compressed_animation]An animation of branches of law appearing on the map sequentially, illustrating the continuum of case law. This continuum corresponds well with our pre-existing understandings of the relationships between the various branches of the law. It makes sense, for example, that development, environment and land law would all be intertwined given their similar subject matter. Likewise, it is not at all unexpected that negligence would cluster closely with transport and employment law when a great many negligence cases centre around motor and workplace accident claims. The map, in effect, crystallises our own mental models of the law. It also shows us that the borders between various areas of the law can often be quite porous. We notice, for instance, that there is a streak of land law judgments that overlaps with commercial and procedural law cases and is disconnected from most other land law cases. Interestingly, cases in this streak tend to focus on mortgage disputes often involving defaults which would explain why they overlap with commercial and procedural law cases. We can also see that there are some transport law judgments that are connected to the cases mainland and then there are others that are connected to the island of substantive criminal law cases. Transport judgments connected to that island often centre around the suspension of transport licences, whereas judgments connected to the cases mainland tend to focus on transport accidents. Although disconnected from one another, however, both clusters of transport cases are still relatively close to each other, reflecting their shared subject matter. Final thoughts By now, we've covered how the map reflects already known distinctions between cases and legislation, while also revealing potential new divisions and hidden connections between various areas of the law. We've also seen how Australian case law can be more of a continuum than a rigidly defined structure and how the borders between branches of case law can often be quite porous. Other specific insights we've been able to find are that: * Migration, family and substantive criminal law are the most isolated branches of case law on the map; * Migration, family and substantive criminal law are the most distant branches of case law from legislation on the map; * Development law is the closest branch of case law to legislation on the map; and * The map does not reveal any noticeable distinctions between Australian state and federal law, whether it be in style, principles of interpretation or general jurisprudence. These are but a selection of the most readily observable insights to be gained from attempting to map Australian law. There are no doubt countless others waiting to be uncovered. Producing a three-dimensional map of Australian laws, cases and regulations could, for example, reveal new hidden relationships that are almost impossible to represent in two dimensions. Adding cases and legislation from other states and territories might also give us a sharper, higher resolution image of the map, deepening our understanding of the geography of Australian law. One could even imagine adding legal documents from other common law countries such as the UK, Canada and New Zealand to, in a sense, photograph the historic and continued interactions between our legal systems. Nevertheless, for a first attempt, the map already has a lot to teach us. Perhaps you've even identified patterns in the map that I could not. The greatest thing about this exercise is that it can be applied to virtually any domain, not just Australian law. Semantic mapping is particularly useful for very quickly developing an understanding of the underlying composition and structure of a dataset without having to manually scour through hundreds of examples to build your own much noisier and less persistent mental model of that data. Since finishing the map, I've already been able to reuse this technique to study countless other seemingly unstructured large datasets, and you can too. It doesn't take an expert in clustering and mapping to pull it off. Far from it. Prior to starting this project, I didn't know the first thing about semantic mapping and now I'm about to teach you how to do it yourself. So how'd you do it? At a high level, the process for mapping any arbitrary set of data points, whether they be PDFs, YouTube videos, TikToks or anything else, can be broken down into six stages, illustrated below. [yH5BAEAAAA][process]An illustration of the process of semantically mapping data. In brief, we try to represent the meaning of data in the form of sets of numbers (vectorisation), after which we group those sets into clusters based on their similarity (clustering) and subsequently label those clusters based on whatever unique patterns we can find in them (labelling). Finally, we project the numerical representations of the data into two-dimensional coordinates (dimensionality reduction) which we then plot on a map (visualisation). Through this next section, we'll take a deeper look at exactly how every step of the semantic mapping process works in practice. Before that though, I'd like to express my gratitude to the creators of BERTopic, a topic modelling technique which this process was loosely based on, as well as Dr Mike DeLong whose topic map of the Open Australian Legal Corpus served as the inspiration for this entire project. Vectorisation The first step in semantically mapping a dataset is to vectorise its data points. In this context, vectorisation refers to the process of converting information into a set of numbers intended to represent its underlying meaning, known as a vector or embedding. By calculating how similar vectors are to one another, we can also get a rough idea of how similar they are in meaning. This principle is what allows us to later group data points into clusters and project them onto a two-dimensional map. To vectorise a data point, we can use an embedding model, a model specially trained for the task of representing the meaning of information as vectors. Thankfully, for my uses and probably yours too, it isn't necessary to train a custom embedding model or pay someone to use theirs. At least for text vectorisation, most of the world's best models are already available for free and under open-source licences. Hugging Face helpfully maintains a ranked list of the most accurate text embedding models as benchmarked against hundreds of datasets, known as the Massive Text Embedding Benchmark (MTEB) Leaderboard. When I built the map, BAAI/bge-small-en-v1.5 was one of the best open-source models available for its size, so that's what I went with. Nowadays, avsolatorio/GIST-small-Embedding-v0 (a finetune of that model) ranks higher, but its worth checking out the leaderboard yourself as new models are released every day. One constraint of contemporary text embedding models worth keeping in mind is that they can only vectorise a fixed number of tokens, known as a context window. If you don't know what a token is, you can think of it as the most basic unit of input a text model can take. There are roughly 0.75 tokens in a word. So, if a text embedding model's context window is 512 tokens like GIST-small-Embedding-v0's is, then you can only vectorise roughly 384 words at a time. To get around this, we can split text into chunks up to 512 tokens long, vectorise each chunk and then average those vectors to produce a mean text embedding that represents the average meaning of the text. This process can also be applied to vectorise videos and audio clips longer than what an embedding model can take as input or really any other type of embeddable data. [yH5BAEAAAA][OALM-Vecto]An illustration of the process of producing a mean embedding. In splitting up our long-form data, however, it is essential that we do so in as meaningful a way as possible. Simply breaking up text at every 512th-token or, if we're working with audiovisual data, 512th-second, could result in the loss of semantically important information. Imagine if we ended up splitting the sentence 'I love to eat kangaroo gummies, they're my favourite snack' at the word 'kangaroo', resulting in the chunks 'I love to eat kangaroo' and 'gummies, they're my favourite snack'. The final embedding would no doubt be quite dissimilar from the text's actual meaning. Ideally, we'd like for our data to have already been divided into semantically meaningful sections that are all under our model's context window. Realistically though, our data may not have any sections at all or, if it does have sections, not all may fit within the model's context window. In such cases, we can first split our data into whatever parts we do have and then use a semantic chunker to bring whatever data is over the context window, under it. For text data, I'd recommend semchunk, an extremely fast and lightweight Python library I developed to split millions of tokens worth of text into chunks as semantically meaningful as possible in a matter of seconds. It works by searching for sequences of characters known to indicate semantic breaks in text such as consecutive newlines and tabs, and then recursively splitting at those sequences until a text is under the given chunk size. The code snippet below demonstrates exactly how you can use semchunk to split a dataset of documents into chunks with any given Hugging Face text embedding model of your choice. Just make sure you have semchunk and transformers installed beforehand. After chunking our data, we still need to vectorise it and then average those vectors such that [1, 2] and [3, 4] becomes [2, 3] (not [7]). This is how'd you do that in practice, keeping in mind this code also requires torch and tqdm: Dimensionality reduction Now that we've vectorised our data, the next step is to reduce its dimensionality. Dimensionality reduction is where you take a really long vector like [4, 2, 1, 5, ..] and turn it into lower-dimension coordinates like [4, 2]. This is how we map our data. It also makes our data easier to cluster later on, since high-dimensional data can often be quite tricky to cluster due to the 'curse of dimensionality '. To reduce the data's dimensionality, we use a dimensionality reduction model, a model capable of projecting high-dimensional data into low-dimensional spaces while preserving as much meaningful information as possible. The model I used was PaCMAP, which benchmarks as one of the fastest and most accurate dimensionality reduction models, capable of preserving both global and local structures in underlying datasets. The visualisation below, courtesy of their GitHub repository, shows what it looks like when you attempt to reduce a three-dimensional model of a mammoth down to two dimensions with PaCMAP (visible on the far right) and other popular dimensionality reduction models. [yH5BAEAAAA][PaCMAP-Mammoth]A visualisation of the reduction of a three-dimensional model of a mammoth down to two dimensions with the most popular dimensionality reduction models, courtesy of the Apache-2.0 licensed PaCMAP GitHub repository. Now, because we're using PaCMAP for two different purposes, namely, to map the data and to make it easier to cluster, we can reduce our vectors to two different dimensions. For mapping, two dimensions is what I went with but it's also possible to visualise three. For clustering, I choose 80 because my clusters seemed to benefit from high-dimensional data and 80 was the most dimensions I could use without slowing my PC down too much. What worked for me, however, may not work for you. With another dataset of ~400 data points, much lower than the Open Australian Legal Corpus' 200k documents, I had found that 2 dimensions worked considerably better than 80. It is worth testing a range of dimensions to see what yields the best clusters for your data. After installing the pacmap Python package, you can use the following code to reduce the dimensionality of your data for both mapping and later clustering it: Clustering Once the dimensionality of your data has been reduced, we can use a clustering model to group it into clusters of data points that are close together in our vector space. These clusters tend to correlate with the broad set of topics and themes present in a dataset. There are a range of clustering models to choose from, each with their own unique advantages and drawbacks. I ended up settling on HDBSCAN, which is generally well-regarded. They have a page in their documentation which covers their differences with other popular clustering methods. The most notable differentiator is that, unlike older algorithms such as k-means, HDBSCAN does not force every data point into a cluster, which is quite reasonable. There will always be data points that don't quite fit into a known box. Forcing them into boxes just makes those boxes noisier. The only problem with HDBSCAN is that it can sometimes be overzealous in refusing to cluster data points. In my case, there were 218,336 legal texts in the Open Australian Legal Corpus at the time that I produced the map and 84,780 (38.8%) could not be clustered. A further 10,100 (4.6%) were placed in clusters that did not appear to have any meaningful unifying features. In total, there were 94,880 (43.4%) documents that could not be assigned to a meaningful cluster and so were excluded from the map. This is what the map would have looked like if I had included them. [yH5BAEAAAA][unassigned]A version of the map with documents without meaningful clusters included. It is clear that there were documents that HDBSCAN could and should have clustered, such as those part of the criminal and family law islands. This is likely the result of the curse of dimensionality but it may also be because I used fast_hdbscan, a faster implementation of HDBSCAN that I later discovered has a tendency to produce patchier clusters than regular HDBSCAN. Accordingly, I'll be using regular HDBSCAN in my code example. You will notice that there are two hyperparameters that may be tuned, min_cluster_size and min_samples. Thanks to this Reddit comment and my own experimentation, I have found that min_samples should approach log(n) the noisier your dataset is, where n is the number of data points in the dataset. For clean data, it is appropriate to set min_samples to 1, which is what I used. min_cluster_size refers to the minimum number of data points that should be in a cluster. It is possible to yield meaningful clusters with both low and high minimum cluster sizes. It may be thought of as a control on how generalised clusters should be. For granular clusters reflecting specific topics in your data, a low minimum cluster size is preferable. For a couple broad clusters reflecting general themes in your data, a higher minimum cluster size is advised. I ended up going with a cluster size of 50, which, in proportion to the size of my database, was minuscule. I only did this because I wanted to manually merge clusters myself in order to ensure that the final clusters were both as broad as I wanted them to be and were as precise as they could be. This resulted in 507 unique clusters (excluding the unassigned cluster), which I manually whittled down to 19 branches of law. I'll get into how I pulled that off in the next section but for now, here is the code for how HDBSCAN can be used to cluster vectors: Labelling After clustering your data, you'll want to assign some meaningful labels to those clusters. This means figuring out exactly what it is that they share in common. There are a number of techniques to choose from for identifying meaningful labels for clusters, including the use of tf-idf, generative AI-based labellers and of course hand labelling. I won't get into all the options available, I'll just cover what worked for me. First, I identified the top tokens in each cluster by their tf-idf, which is a measure of a token's frequency in a cluster weighted by its overall frequency in a dataset, such that tokens that are extremely common in only one cluster will have a higher tf-idf for that cluster than tokens that are extremely common in all clusters. This served as an easy way to quickly associate clusters with labels reflecting their unique composition. With the top tokens by tf-idf in hand, I merged any clusters whose top four tokens were the same, which got rid of just 2 of the 507 clusters. Next, I manually reviewed the clusters in order to produce a set of 337 rules on how to merge them based on their top tokens. In manually merging clusters, I tried my best to be as agnostic as possible on what the final set of categories should look like. The idea was to let the data guide me, rather than me guiding the data. As the number of clusters began to dwindle, however, I soon found myself forced to make increasingly difficult decisions about what categories to include and exclude, such as whether it would better to roll up health and social services law into a single area of law rather than creating a commercial law category out of tax, finance and insolvency law. I originally wanted to include many more areas of law than the 19 branches you see on the map now, but I was ultimately constrained by the fact that even visualising 19 easily distinguishable categories on a map with contiguous continents is no small feat (it was only thanks to a colour palette published by fellow data scientist Sasha Trubetskoy that I managed to pull it off!). Eventually, I ended up settling on a set of clusters that I felt were a reasonable way of dividing up my map, although I recognise it may not necessarily be the most optimal way, if such an optimum even exists. This merging process was one of the most difficult components of building the map, second only to writing this article, but it also taught me a lot not only about the broad makeup of Australian law but also about the many different ways law in general can be sliced up. I would only recommend manually merging clusters if you also want to develop an intimate understanding of the composition of your data or if it is particularly important to you that the final product be as precise and accurate as possible. Otherwise, it would be much more beneficial to tune your clustering model to produce a more manageable amount of clusters. You can then automatically label those clusters by either taking their top three tokens by tf-idf as their label or using a large language model to generate more coherent labels from those tokens. In the code snippet below, I show how you can identify the top tokens in a cluster by tf-idf. Please note that the code relies on nltk. If you wanted to take your automated labelling a step further, you could also use the following code to get GPT-4 (or another OpenAI API-compatible model) to generate labels for you, keeping in mind that this code requires the openai and semchunk Python libraries. Visualisation At this point, the only piece of the puzzle left is to visualise your map. There are a number of Python libraries capable of producing two- and three-dimensional scatter plots, but none of them are particularly impressive, including the library I eventually settled on, which was Plotly. My main grievance with Plotly is that it does not let you expand the size of data points when zooming into a map. This really becomes an issue where you have hundreds of thousands of data points and you find that they either overlap with each other or, if you reduce their size, they become almost impossible to identify when zoomed in. There is a 3-year-old GitHub issue concerning this problem but it doesn't look like it will get solved anytime soon. There were other less severe issues I experienced with Plotly that I was able to work around with custom CSS and Javascript. I won't provide that code at the moment as it is not particularly pretty, but I will share a code snippet illustrating how Plotly can be used to visualise mapped data: With that done, you should now have your very semantic map. The next step is to analyse it. Look for patterns in the map's geography, inspect outliers, search for islands, get a sense of the underlying structure of your data. As my own analysis has shown, there is a lot you can learn just by mapping a dataset. And, once you get the ball rolling, it can quickly spiral into an addition. I have hundreds of ideas for how to expand my map to uncover new relationships. The real power of semantic mapping comes out when you apply it against very large datasets. Imagine applying these techniques on the Common Crawl corpus, for example. You would be able to produce a first-of-its-kind high-resolution map of the internet. If you do end up publishing your own semantic map, be sure to cite this article to enable others to learn about the power of semantic mapping. Otherwise, happy mapping!