[HN Gopher] Show HN: how I built the largest open database of Au...
___________________________________________________________________
Show HN: how I built the largest open database of Australian law
Author : ubutler
Score : 99 points
Date : 2023-10-29 12:06 UTC (10 hours ago)
(HTM) web link (umarbutler.com)
(TXT) w3m dump (umarbutler.com)
| ubutler wrote:
| Hey HN, Over the past year, I've been working on building the
| Open Australian Legal Corpus, the largest open database of
| Australian law. I started this project when I realised there were
| no open databases of Australian law I could use to train an LLM
| on.
|
| In this article, I run through the entire process of how I built
| my database, from months-long negotiations with governments to
| reverse engineering ancient web technologies to hacking together
| a multitude of different solutions for extracting text from
| documents.
|
| My hope is that the next time someone like me is interested in
| training an LLM to solve legal problems, they won't have to go
| down a year-long journey of trying to find the right data!
|
| You can find my database on HuggingFace
| (https://huggingface.co/datasets/umarbutler/open-australian-l...)
| and the code used to create it on GitHub
| (https://github.com/umarbutler/open-australian-legal-
| corpus-c...).
| nextworddev wrote:
| Awesome work!
| benn0 wrote:
| Fantastic work, and really appreciate the write up. It's quite
| timely for me - I'm from a tech background and have just
| started studying Australian law, and was thinking about doing
| exactly this - so you are years ahead of me :).
|
| Just one note - the link in your Github readme to
| https://umarbutler.com/open-australian-legal-corpus doesn't
| seem to go anywhere.
|
| For someone interested in using the data (and help out with
| bugs/issues), where would you suggest starting?
| DamonHD wrote:
| Would it we worth getting your corpus replicated into other
| venues as well, such at the Internet Archive or on GitHub itself?
| ubutler wrote:
| Github might be difficult as they impose constraints on the
| size of repositories and the Corpus is around 5GB. The Internet
| Archive is a good idea, however, I'll have a look into that.
| I've also been thinking about sticking it on Kaggle as well to
| increase its reach.
| DamonHD wrote:
| There are also national and university data repositories that
| might be interested and for which 5GB is not even noticable!
| Shorn wrote:
| You could also consider one or more of the scientific data
| repositories like Zenodo, FigShare, DataDryad, etc. 5GB is
| small potatoes for those folks and they have serious data
| retention policies. As a bonus, they'll also allocate you a
| citable DOI.
| freefaler wrote:
| Great work and congratulations on your tenacity dealing with
| bureaucrats. Open access and machine readable formats should be
| widely available.
| danielmarkbruce wrote:
| Insanely great. Amazing work.
| nextworddev wrote:
| Is there a U.S. equivalent?
| Something1234 wrote:
| Govinfo.gov and the house provides the entire us law corpus but
| it's weird
| showerst wrote:
| The Feds have good sources for the all various admin
| code/statutes/slip laws/etc, but there's not a great unified
| source for case law.
|
| There's nothing at the state level right now. I've been
| considering setting up a statue scraper under the openstates
| umbrella but it's a bit of daunting project to start. Lots of
| yeoman's work parsing gnarly websites or evading Lexis scraper
| protections.
| yieldcrv wrote:
| no, there's a lot that we need before we can even begin to
| improve this
|
| even with a database of current laws as they exist right now,
| the laws to change them primarily come in 2 forms:
|
| 1. verbatim additional laws
|
| 2. instructions that are essentially diffs to the current law.
| what words to change, strike out, sections to re-arrange and
| modify, as well as new lines of code. these have to be spliced
| in to the prior state of the law
|
| and after we have all that, laws are often following different
| logic. like logic gates. One set of laws may be using "and" as
| a set of conditions that must be satisfied all as one, but it
| also could be using "and" and an "exclusive or", a set of
| conditions where only one has to be satisfied. but when writing
| it, those things all flowed grammatically and harmonization of
| laws wasn't prioritized.
|
| there's a whole lot that can be improved that we don't have the
| infrastructure to do just yet. someone could do it, but that's
| the first step.
| bentley wrote:
| PRO is a prominent player in the space.
|
| https://public.resource.org/
|
| Notably, it's thanks to them that in 2020 the Supreme Court
| ruled Georgia's legal code, including annotations, is
| uncopyrightable.
| thomasfromcdnjs wrote:
| Incredible effort.
|
| These types of projects have the potential to influence a nation.
| subhashp wrote:
| Well done!
| ulrischa wrote:
| I think in the law related subjects there is a huge potential for
| digitalisation. In Germany the law texts are online but the
| paragraphs not linked
| cookie_monsta wrote:
| This is cool and I'm a little surprised to see that Victoria is
| the one dragging the chain here. Is DataVic just talk, or does
| that not apply to law for some reason?
| mrmincent wrote:
| Yeah I was also disappointed to hear that. Afaik (I asked
| someone who previously worked in an adjacent field), it sounds
| like there's no central system for publishing the judgements,
| it's all published by each individual courthouse in a way that
| suits them, so lots of tedious individual scraping would be
| involved I'd imagine.
| darcys22 wrote:
| So good! Its crazy how legal information is such a spread out
| mess.
|
| Whats worse is that git is such a perfect solution for
| legislation.
| juliangamble wrote:
| Australia has had free, searchable collections of Australian Law
| for 25+ years. Austlii is a prime example. There are Federal and
| State collections as well. The author is conscientious enough to
| read the scraping policy (or was blocked by anti-scraping tools)
| from feeding from one of these sites into his LLM.
| ubutler wrote:
| As I point out in the introduction, there are a few free-to-
| access legal databases available in Australia, but none are
| truly _open_ in the sense of being free from copyright
| restrictions. Neither AustLII nor Jade are licensed under an
| open source copyright licence such as CC BY 4.0 (which is what
| the majority of my Corpus is under).
| ENGNR wrote:
| Love what you're doing! Being able to more easily bring LLMs
| and other AI in will democratise the law quite a bit. Agreed
| that even though Austlii exists, it needs to be under a
| creative commons license, and it takes someone doing the
| legwork of getting permission to get it there
| Obscurity4340 wrote:
| What do you think of the Canadian legal case law website CanLii?
| What could it do better or do you think its done well?
|
| Is it overdue for innovation?
___________________________________________________________________
(page generated 2023-10-29 23:01 UTC)