Open and shut: IP tactics in the race for AI supremacy

3 Jan 2024Feature

Ben Maling discusses two different approaches to intellectual property by two different AI companies

The artificial intelligence (AI) arms race is well and truly underway, framed in the media as a breathless clash between the titans of Big Tech, with stakes no less than the future of humanity itself. Viewed through the lens of a patent attorney working in AI, it is also the story of two fundamentally different approaches to controlling and leveraging intellectual property (IP).

For years, tech companies like Meta and Google have been developing and deploying AI extensively within their products. While they protect business-critical innovation with trade secrets and patents, they also publish much of their research and contribute to open source software projects.

Notable developments

In 2017, researchers at Google published a seemingly innocuous paper ‘Attention is all you need’, introducing a new type of AI model – the transformer – for processing and generating sequences of ‘tokens’ such as words. Researchers were quick to appreciate the potential of the transformer, but one company took it and ran more than any other. That company was OpenAI, at the time a non-profit research organisation which promised to ‘freely collaborate’ with other institutions and researchers by making its patents and research open to the public. And, so it did, releasing early iterations of its ‘generative pre-trained transformers’ (GPTs) under open source licences. GPT-1 and GPT-2 served as proofs-of-concept that piqued the interest of the industry, and the various tech giants quietly went about building their own large language models (LLMs) based on the same technology.

In 2019, something changed – OpenAI shifted from a non-profit to a ‘capped profit’ business, and from then on, their most powerful models would only be accessible via a user interface (UI) or application programming interface (API), enabling OpenAI to keep the details of their future models confidential. OpenAI was no longer open AI. In November 2022, the newly-closed OpenAI released ChatGPT, and the world woke up to how far the technology had progressed. OpenAI emerged as the leader of the pack of LLM providers, and countless users and businesses have plugged into its APIs to make use of its biggest and best LLM, GPT-4.

The challenges

OpenAI’s head start is protected by high barriers to entry – tens of millions of dollars were needed to train GPT-4 – along with well-guarded secrecy around its training processes and the details of its models. But its lead is precarious. One problem it faces is the flow of talent between AI companies at all levels, and the information that goes with them, exemplified by OpenAI’s recent boardroom drama that led to CEO Sam Altman’s near ousting and subsequent job offer from Microsoft. While trade secrets can provide a legal remedy in the event of leaked information, along with copyright in the event of unauthorised code migration, such leakages are hard to detect, particularly where the recipient is also operating a closed model. Patents are ill-suited to protecting AI models themselves, though the vast majority of the main players are filing patent applications on the more enduring aspects of their AI systems in the hope to establish defensive or offensive patent positions in future. In summary, while the closed model of OpenAI provides a technical barrier to competitors, it lacks legal reinforcement, and is not a reliable moat for keeping competitors at bay, as is illustrated by the next chapter of the story.

The Google memo leak

In May 2023, a memo was leaked from an anonymous engineer at Google, part of the team working on its flagship LLM project, Bard. The document laid out the truth of Google and OpenAI’s IP predicament:

‘The uncomfortable truth is,we aren’t positioned to win this arms race and neither is OpenAI. While we’ve been squabbling, a third faction has been quietly eating our lunch. I’m talking, of course, about open source. Plainly put, they are lapping us.’

The document cited open source projects having solved problems considered by Google to be ‘major and open’, including running LLMs on laptops and phones. The takeaways were that, in the long run, smaller and more nimble models that are quickly adaptable may end up being more capable than giant, slow-moving oil tankers like GPT-4. Furthermore, given the right algorithm, training with smaller, highly curated datasets may outperform scrape-half-of-the-internet approaches favoured by the big players.

The context in which the Google memo was leaked was that just three months earlier, Meta had released source code for training their new language model, Llama, under an open source licence. This would enable people to train their own LLMs based on the Llama framework, but Meta didn’t go as far as releasing the model weights – trained parameters of the model resulting from the multi-million-dollar data scraping and training processes. Without the weights, Llama itself couldn’t be straightforwardly replicated. So, for a short time, Llama was closed, with certain research institutions being given access under a non-commercial licence. A week later, the weights were leaked via the infamous libertarian website 4chan, and the open source community finally got their hands on a highly capable LLM on which they could iterate and build. This was the spark that ignited the explosion of innovation referred to in the Google memo.

Meta’s approach

In effect, Meta lost control of the core IP in its flagship LLM. But Meta’s next move was a bold one. Under the stewardship of Yann LeCun, one of the ‘Godfathers of Deep Learning’ and a staunch open source advocate, Meta decided to fully embrace AI openness, and has gone on to become one of its most vocal proponents, along with several others like Hugging Face and Stability AI. Meta made a great deal of noise (orgling?) about its open release of Llama 2, including the model weights. Meanwhile, increasing numbers of LLMs, as well as other foundation models, such as image generators, have been released with various levels of openness, providing users and developers with a host of alternatives to the closed models of OpenAI and Google. But is all as it seems?

The idea of open source software is underpinned by the premise that computer code is subject to copyright. According to the open source model, it is published in source code form along with a licence that permits any party to freely use, modify and redistribute the code, provided the conditions of the licence are satisfied. In this context, ‘free’ means lacking restrictions (as in ‘free speech’), though it is usually also without charge (as in ‘free beer’). In the traditional setting, open sourcing the software in this way provides the necessary conditions for developers to tweak, recreate and improve upon it. Projects, businesses and ecosystems can be built around this sharing model, providing a host of benefits to participants, such as R&D costs for non-differentiating software components being amortised across different parties, without the need for contractual relationships between the parties. By open sourcing selected code, companies can also encourage uptake of their solutions within the community, which may ultimately drive customers towards revenue generating parts of the business.

This is the broad logic behind Meta’s move. By releasing its Llama models to the community, it encouraged vast numbers of developers to build on top of them, and since they are built in the open on Meta’s own frameworks (as opposed to Google’s or OpenAI’s), Meta can seamlessly incorporate the best ideas into its own codebase. By providing the best open-access model, it gained an army of developers for free (as in ‘beer’!). The open-access model has significant upsides for users too: aside from greater freedom to customise their models, rights under an open licensing regime can never be revoked, providing a level of dependability, while companies reliant on OpenAI’s APIs are exposed to a single point of failure, a point drawn into sharp focus by the company’s recent internal dramas.

So far, so good for open models, but there is also some sleight of hand at play. Open source software licences are supposed to permit any person to use the software for any purpose, and while Meta’s Llama 2 licence permits commercial use, the terms reveal a number of restrictions, most significantly freezing out organisations with more than 700 million monthly active users and a prohibition on using Llama 2 to train other LLMs. This creative licensing has allowed Meta to have it both ways: take a big chunk of OpenAI’s lunch without handing over the keys to their Big Tech rivals. All the while, Meta can trumpet its commitment to openness and ethics – areas they haven't always scored well on in the past.

Even when the licence itself is a true open source licence, applying the open source moniker to AI systems is problematic. Unlike other types of software, the source code and model weights do not provide a user with all the necessary means to study and reproduce the AI. For that, access to the training data is also necessary (along with the requisite computing resources). Details of the training data are notably absent from Meta’s Llama releases, as well as other models they have ‘open sourced’, meaning that anything that is built on Llama is likely to be the true offspring of Llama, ready to be led straight back into Meta’s paddock. This omission has raised criticism from some parts of the open source community, and work is being done to extend the open source definition to capture its true spirit in the context of AI.

The legal landscape

Everything discussed so far has taken place in something of a legal Wild West, while in future there will be an increasing burden to comply with nascent regulatory regimes. In this regard, a study by Stanford University earlier in the year evaluated providers of various foundation models (AI models trained on large volumes of data to carry out diverse ranges of tasks) in terms of their compliance with the draft EU AI Act. None of them fared particularly well, with BigScience/Hugging Face’s open-access large language model BLOOM being closest to compliant. This may be seen as a further advantage of open models, as the open paradigm is naturally aligned with the principle of transparency, which underpins many of the requirements of the Act. On the other hand, some argue that the open sharing of powerful foundation models is irresponsible. To quote Mustafa Suleyman, CEO of Inflection AI and co-founder of DeepMind, speaking at the UK’s AI Safety Summit in November, “On one hand, [smaller, open access models] enable open innovation, academic experimentation, small start-ups to get ahead, all things that we should encourage and embrace [...]. And at the same time, they also give a garage tinkerer the capability to have a one-to-many impact in the world, potentially, unlike anything we’ve ever seen.” It is impossible to withdraw an open access model with dangerous capabilities once released.

Across the pond, Sam Altman of OpenAI spoke to US Congress about concerns over the technology his company helped to create, and suggested a government licensing regime for the development and release of LLMs above a certain capability. Clem Delangue, CEO of Hugging Face, wasn't impressed.In his view, “it would further concentrate power in the hands of a few and drastically slow down progress, fairness and transparency.”Hugging Face instead advocates for open development and sharing AI models using Responsible AI Licenses (RAIL or OpenRAIL), which include field-of-use restrictions to enable a creator of an LLM to sue for copyright infringement in the case of unethical or unsafe downstream use. While it is possible that OpenRAIL and RAIL licensing can go some way to promoting ethical and safe use of LLMs under the open model, they are no substitute for regulation, and most would agree that IP licences alone are not a sufficient deterrent to misuse.

The story is of two very different approaches to IP, exemplified by two very different AI companies. Of course, there are countless other examples, and the ones discussed here are chosen for their notoriousness and their instructiveness. A whole spectrum of IP strategies can be employed around AI systems, and these are likely to be adapted as IP regimes evolve to meet the requirements of AI proliferation.

Ben Maling is managing associate at EIP
eip.com

Legal News desk contact: editorial@solicitorsjournal.com