Love.Law.Robots.

Love.Law.Robots. is moving!

You're browsing the original version of the Love.Law.Robots. Check out the new site. It's prettier and packs loads of new features!

Three Things: New Contracts Dataset

Featured Image `

There’s a new contract dataset in town, and it’s called Contract Understanding Atticus Dataset (CUAD). Unlike LegalBert (which is totally different), this dataset is expertly annotated, which means legally trained people annotated the contracts. They have annotated contracts based on the type of clauses — governing law clauses, warranty clauses, etc. There’s a scientific paper ("CUAD: An Expert-Annotated NLP Dataset for Legal Contract Review") and code to go along with it.

Thing 1: Good Data is hard to come by#

One might think that with all the new and fancy models coming out every few months, natural language processing (NLP) is ready to solve all the world’s problems. At least in the legal world, anything possible is made really difficult by the lack of useful data. I would sum up the problems like this:

  • There aren’t enough engineers and technically trained people to understand what legal users need.
  • There aren’t enough lawyers who have the technical and engineering knowledge to tell others what they need.
  • Whatever is out there isn’t designed to be used by computers or lawyers, or is kept behind an exclusive paywall.

As such, I am definitely excited when there’s something I can use.

CUAD’s paper already mentions the prohibitive costs and efforts of creating such a dataset:

… a conservative estimate of the pecuniary value of CUAD is over $2 million (each of the 9283 pages were reviewed at least 4 times, each page requiring 5-10 minutes, assuming a rate of $500 per hour).

I won’t use an hourly rate of $500 for law students. I agree that this dataset is unique and definitely valuable, and I am very grateful that it is freely available!

Thing 2: CUAD has its limitations#

If you are familiar with datasets, you will find that CUAD is actually pretty tiny. At 510 contracts, it features only a small sliver of the contract world. Furthermore, they are culled from the SEC Edgar database, which means that they feature the drafting practices of large American corporate firms. Edgar contracts aren’t considered to be the paragon of contract drafting excellence either. It’s a tiny window into a very specialised world.

What about the job it was tasked to do (categorising contract clauses)? The paper concludes that your mileage might vary. The models do very well — surprise, surprise — for governing law clauses but do poorly in things like the right of first refusal.

Source: CUAD Paper

Source: CUAD Paper

I was not surprised that the results fell within such a broad range. Governing law clauses don’t feature many choices (Singapore or English law?). They are of recent vintage in international transactions, which means that it does not have as many idiosyncratic variations as other clauses. Compared to right to first offer clauses, for example, one must choose between when it is activated and what is its consequences, as well as exceptions, and are very heavily negotiated.

CUAD’s paper also suggests that the dataset’s size also matters, and they demonstrated it by using parts of their dataset. Huge improvements can be seen between 100 to 1000 annotations and steady improvements between 1000 and 10,000 annotations. I think intuitively I agree with this assessment and their conclusion that “data is a bottleneck”.

I would be interested in finding out whether learning with data such as CUAD can be transferred to more specific, regional domains. For example, if I managed to annotate, say, 100 contracts of Singapore law, could I use CUAD as a foundation for my model to learn better on my data? This isn’t a straightforward answer. If the model learned American drafters' idiosyncrasies, the model might “unlearn” if it encountered more local contracts.

Thing 3: Living and Dying on the cutting edge#

I ain’t a fussy guy. Even though the paper warns that a lawyer would probably encounter two irrelevant clauses for every relevant clause using the model, I still think it’s useful. I don’t agree that contract review is like finding a needle in a haystack. However, if the model can narrow down my search or highlight things I miss in a manual review, it will help.

I reached my limit by experimenting with this dataset when I loaded the DeBERTa-v2 checkpoint and tokenised my input using hugging face. However, all I got running my test inputs were a bunch of vectors. 😭

Unlike many other projects that I am used to (or spoiled by), documentation on using the model is quite sparse. So I suppose I have to put this down for now.

At some point, when my NLP skills improve, or I find more examples of applying the CUAD to learn from, I should pick this up again.

Conclusion#

CUAD demonstrates the difficulties for a newcomer with specialised domains like law. It’s expensive, time-consuming and hard to learn for a beginner. Maybe at some point, we will find that tools are easier to deploy. Till then, you would have to live and die at the cutting edge.

I regularly write about Tech issues that interests me. Read on for more insights and information:
  • Who wants to do an E-Will?

    COVID-19 brought home a few trends most would not bother with otherwise. Remote working instead of showing up at an office and passing four budgets in as many months — it’s time to question deeply-held assumptions. One deeply-held assumption which we might not be able to shake off is the archaic and highly formal process of getting wills done. The Wills Act is considerably vintage, dating earlier than the mid-nineteenth century.
    COVID-19 brought home a few trends most would not bother with otherwise. Remote working instead of showing up at an office and passing four budgets in as many months — it’s time to question deeply-held assumptions. One deeply-held assumption which we might not be able to shake off is the archaic and highly formal process of getting wills done. The Wills Act is considerably vintage, dating earlier than the mid-nineteenth century.
  • What I wouldn’t build

    Natural language processing could be relevant for legal applications, but its basis remains in science and computers. Once again, due to the COVID-19 epidemic, more presentations are available online and free, and I picked this which was a keynote at a recent “Widening Natural Language Processing” conference.Looks like the recording is publicly available as well: https://t.co/udMofyCZKU — Rachael Tatman (@rctatman) July 5, 2020 I think the presentation is easy enough to follow without a detailed knowledge of NLP.
    Natural language processing could be relevant for legal applications, but its basis remains in science and computers. Once again, due to the COVID-19 epidemic, more presentations are available online and free, and I picked this which was a keynote at a recent “Widening Natural Language Processing” conference.Looks like the recording is publicly available as well: https://t.co/udMofyCZKU — Rachael Tatman (@rctatman) July 5, 2020 I think the presentation is easy enough to follow without a detailed knowledge of NLP.
  • WestLaw takes a shot at ROSS

    – A David and Goliath battle brews over the collection of data used for ROSS Legal's AI products.
    A David and Goliath battle brews over the collection of data used for ROSS Legal's AI products.
  • Detox your accounts with better passwords now!

    The Problem that no one wants to talk about Disgusting practices everywhere… One of the most intriguing changes in my time online regards passwords. Back in the yahoo and hotmail days, passwords are an annoyance, and you would use the easiest thing — birth dates, ID numbers, mother’s maiden name — to get rid of it. We then had to adjust our strategies when password policies became fashionable.
    The Problem that no one wants to talk about Disgusting practices everywhere… One of the most intriguing changes in my time online regards passwords. Back in the yahoo and hotmail days, passwords are an annoyance, and you would use the easiest thing — birth dates, ID numbers, mother’s maiden name — to get rid of it. We then had to adjust our strategies when password policies became fashionable.
  • Get your daily dose of Sudoku with a little bit of Python

    – The goal was to do this in one night. It was a very long night.
    The goal was to do this in one night. It was a very long night.