[HN Gopher] Show HN: how I built the largest open database of Au...
       ___________________________________________________________________
        
       Show HN: how I built the largest open database of Australian law
        
       Author : ubutler
       Score  : 99 points
       Date   : 2023-10-29 12:06 UTC (10 hours ago)
        
 (HTM) web link (umarbutler.com)
 (TXT) w3m dump (umarbutler.com)
        
       | ubutler wrote:
       | Hey HN, Over the past year, I've been working on building the
       | Open Australian Legal Corpus, the largest open database of
       | Australian law. I started this project when I realised there were
       | no open databases of Australian law I could use to train an LLM
       | on.
       | 
       | In this article, I run through the entire process of how I built
       | my database, from months-long negotiations with governments to
       | reverse engineering ancient web technologies to hacking together
       | a multitude of different solutions for extracting text from
       | documents.
       | 
       | My hope is that the next time someone like me is interested in
       | training an LLM to solve legal problems, they won't have to go
       | down a year-long journey of trying to find the right data!
       | 
       | You can find my database on HuggingFace
       | (https://huggingface.co/datasets/umarbutler/open-australian-l...)
       | and the code used to create it on GitHub
       | (https://github.com/umarbutler/open-australian-legal-
       | corpus-c...).
        
         | nextworddev wrote:
         | Awesome work!
        
         | benn0 wrote:
         | Fantastic work, and really appreciate the write up. It's quite
         | timely for me - I'm from a tech background and have just
         | started studying Australian law, and was thinking about doing
         | exactly this - so you are years ahead of me :).
         | 
         | Just one note - the link in your Github readme to
         | https://umarbutler.com/open-australian-legal-corpus doesn't
         | seem to go anywhere.
         | 
         | For someone interested in using the data (and help out with
         | bugs/issues), where would you suggest starting?
        
       | DamonHD wrote:
       | Would it we worth getting your corpus replicated into other
       | venues as well, such at the Internet Archive or on GitHub itself?
        
         | ubutler wrote:
         | Github might be difficult as they impose constraints on the
         | size of repositories and the Corpus is around 5GB. The Internet
         | Archive is a good idea, however, I'll have a look into that.
         | I've also been thinking about sticking it on Kaggle as well to
         | increase its reach.
        
           | DamonHD wrote:
           | There are also national and university data repositories that
           | might be interested and for which 5GB is not even noticable!
        
           | Shorn wrote:
           | You could also consider one or more of the scientific data
           | repositories like Zenodo, FigShare, DataDryad, etc. 5GB is
           | small potatoes for those folks and they have serious data
           | retention policies. As a bonus, they'll also allocate you a
           | citable DOI.
        
       | freefaler wrote:
       | Great work and congratulations on your tenacity dealing with
       | bureaucrats. Open access and machine readable formats should be
       | widely available.
        
       | danielmarkbruce wrote:
       | Insanely great. Amazing work.
        
       | nextworddev wrote:
       | Is there a U.S. equivalent?
        
         | Something1234 wrote:
         | Govinfo.gov and the house provides the entire us law corpus but
         | it's weird
        
         | showerst wrote:
         | The Feds have good sources for the all various admin
         | code/statutes/slip laws/etc, but there's not a great unified
         | source for case law.
         | 
         | There's nothing at the state level right now. I've been
         | considering setting up a statue scraper under the openstates
         | umbrella but it's a bit of daunting project to start. Lots of
         | yeoman's work parsing gnarly websites or evading Lexis scraper
         | protections.
        
         | yieldcrv wrote:
         | no, there's a lot that we need before we can even begin to
         | improve this
         | 
         | even with a database of current laws as they exist right now,
         | the laws to change them primarily come in 2 forms:
         | 
         | 1. verbatim additional laws
         | 
         | 2. instructions that are essentially diffs to the current law.
         | what words to change, strike out, sections to re-arrange and
         | modify, as well as new lines of code. these have to be spliced
         | in to the prior state of the law
         | 
         | and after we have all that, laws are often following different
         | logic. like logic gates. One set of laws may be using "and" as
         | a set of conditions that must be satisfied all as one, but it
         | also could be using "and" and an "exclusive or", a set of
         | conditions where only one has to be satisfied. but when writing
         | it, those things all flowed grammatically and harmonization of
         | laws wasn't prioritized.
         | 
         | there's a whole lot that can be improved that we don't have the
         | infrastructure to do just yet. someone could do it, but that's
         | the first step.
        
         | bentley wrote:
         | PRO is a prominent player in the space.
         | 
         | https://public.resource.org/
         | 
         | Notably, it's thanks to them that in 2020 the Supreme Court
         | ruled Georgia's legal code, including annotations, is
         | uncopyrightable.
        
       | thomasfromcdnjs wrote:
       | Incredible effort.
       | 
       | These types of projects have the potential to influence a nation.
        
       | subhashp wrote:
       | Well done!
        
       | ulrischa wrote:
       | I think in the law related subjects there is a huge potential for
       | digitalisation. In Germany the law texts are online but the
       | paragraphs not linked
        
       | cookie_monsta wrote:
       | This is cool and I'm a little surprised to see that Victoria is
       | the one dragging the chain here. Is DataVic just talk, or does
       | that not apply to law for some reason?
        
         | mrmincent wrote:
         | Yeah I was also disappointed to hear that. Afaik (I asked
         | someone who previously worked in an adjacent field), it sounds
         | like there's no central system for publishing the judgements,
         | it's all published by each individual courthouse in a way that
         | suits them, so lots of tedious individual scraping would be
         | involved I'd imagine.
        
       | darcys22 wrote:
       | So good! Its crazy how legal information is such a spread out
       | mess.
       | 
       | Whats worse is that git is such a perfect solution for
       | legislation.
        
       | juliangamble wrote:
       | Australia has had free, searchable collections of Australian Law
       | for 25+ years. Austlii is a prime example. There are Federal and
       | State collections as well. The author is conscientious enough to
       | read the scraping policy (or was blocked by anti-scraping tools)
       | from feeding from one of these sites into his LLM.
        
         | ubutler wrote:
         | As I point out in the introduction, there are a few free-to-
         | access legal databases available in Australia, but none are
         | truly _open_ in the sense of being free from copyright
         | restrictions. Neither AustLII nor Jade are licensed under an
         | open source copyright licence such as CC BY 4.0 (which is what
         | the majority of my Corpus is under).
        
           | ENGNR wrote:
           | Love what you're doing! Being able to more easily bring LLMs
           | and other AI in will democratise the law quite a bit. Agreed
           | that even though Austlii exists, it needs to be under a
           | creative commons license, and it takes someone doing the
           | legwork of getting permission to get it there
        
       | Obscurity4340 wrote:
       | What do you think of the Canadian legal case law website CanLii?
       | What could it do better or do you think its done well?
       | 
       | Is it overdue for innovation?
        
       ___________________________________________________________________
       (page generated 2023-10-29 23:01 UTC)