asane

We finally have an “official” definition for open source AI

Online Niel October 28, 2024

There is finally an “official” definition of open source AI.

Open Source Initiative (OSI), a long standing institution with the goal of defining and “managing” all things open source, today released version 1.0 of the Open Source AI Definition (OSAID). The product of many years of collaboration with academia and industry, OSAID is intended to provide a standard by which anyone can determine whether AI is open source—or not.

You might be wondering—as this reporter was—why consensus matters for an open source definition of AI. Well, a big motivation is getting decision makers and AI developers on the same page, said Stefano Maffulli, executive vice president of OSI.

“Regulators are already watching the space,” Maffulli told TechCrunch, noting that bodies like the European Commission have sought to give special recognition to open source. “We made an explicit outreach to a diverse set of stakeholders and communities – not just the usual tech suspects. We even tried to reach out to the organizations that speak most often to regulators to get their early feedback.”

Open AI

To be considered open source under OSAID, an AI model must provide enough information about its design that a person could “substantially” recreate it. The model must also disclose any pertinent details about its training data, including its provenance, how the data was processed, and how it may be obtained or licensed.

“An open source AI is an AI model that allows you to fully understand how it was built,” Maffulli said. “This means you have access to all components, such as the full code used for training and data filtering.”

OSAID also sets out the usage rights that developers should expect with open source AI, such as the freedom to use the model for any purpose and modify it without having to ask anyone’s permission. “Most importantly, you should be able to build on top,” Maffulli added.

OSI has no law enforcement mechanisms to speak of. It cannot pressure developers to comply with or follow OSAID. But it does intend to flag designs described as “open source” but that don’t meet the definition.

“Our hope is that when someone tries to abuse the term, the AI community will say, ‘We don’t recognize this as open source,’ and it will be corrected,” Maffulli said. Historically, this has had mixed results, but it is not completely without effect.

Many startups and large tech companies, especially Meta, have used the term “open source” to describe their strategies for releasing AI models—but few meet OSAID’s criteria. For example, Meta requires platforms with more than 700 million monthly active users to apply for a special license to use them Blade models.

Maffulli was openly critical by Meta’s decision to call its models “open source”. After discussions with OSI, Google and Microsoft agreed to stop using the term for designs that are not fully open, but Meta did not, he said.

Stability AI, which has long touted its models as “open,” requires businesses making more than $1 million in revenue to obtain an enterprise license. And the French AI upstart, Mistral’s license prohibits the use of certain designs and results for commercial enterprise.

A study Last August, researchers from the Signal Foundation, the nonprofit AI Now Institute, and Carnegie Mellon discovered that many “open source” designs are basically open source in name only. The data needed to train the models is kept secret, the computing power needed to run them is beyond the reach of many developers, and the techniques to tune them are intimidatingly complex.

Rather than democratizing artificial intelligence, these “open source” projects tend to entrench and expand centralized power, the study’s authors concluded. Indeed, Meta’s Lllama models have close hundreds of millions of downloads and stability claims that its models power up to 80% of all AI-generated images.

Divergent opinions

Meta disagrees with this assessment, unsurprisingly — and disagrees with OSAID as written (despite who participated in the drafting process). A spokesperson defended the company’s license for Llama, arguing that the terms — and accompanying acceptable use policy — act as guardrails against harmful deployments.

Meta also said it is taking a “cautious approach” to sharing model details, including training data details, because regulations such as California Instructional Transparency Act evolve.

“We agree with our partner OSI on many things, but we, like others in the industry, disagree with their new definition,” the spokesperson said. “There is no single open source definition of AI, and defining it is challenging because previous open source definitions do not capture the complexities of today’s rapidly advancing AI models. We make Llama free and openly available, and our license and acceptable use policy help keep people safe by enforcing restrictions. We will continue to work with OSI and other industry groups to make AI more accessible and responsibly free, regardless of technical definitions.”’

The spokesperson pointed to other efforts to code “open source” AI, such as suggested by the Linux Foundation. definitionsFree Software Foundation criteria for “free machine learning apps” and Motions from other AI researchers.

Meta, rather incongruously, is one of the companies funding OSI’s work – along with tech giants such as Amazon, Google, Microsoft, Cisco, Intel and Salesforce. (OSI recently won a grant from the nonprofit Sloan Foundation to reduce its reliance on tech industry backers.)

Meta’s reluctance to disclose training data likely has to do with the way — and most — AI models are developed.

Artificial intelligence companies collect large amounts of images, audio, video and more from social networks and websites and train their models on this “publicly available data”, as it is commonly called. In today’s market, a company’s methods of assembling and refining data sets are considered a competitive advantage, and companies quote this as one of the main reasons for their non-disclosure.

But details about training data can also be a legal target on developers’ backs. Authors and Publishers Claim that Meta used copyrighted books for training. Artists have filed lawsuits against Stability for scraping their work and reproducing it without credit, an act they liken to theft.

It’s not hard to see how OSAID could be problematic for companies trying to settle lawsuits favorably, especially if plaintiffs and judges find the definition compelling enough to use in court.

Open questions

Some suggest that the definition does not go far enough, for example in how it deals with the licensing of training data property. Luca Antiga, CEO of Lightning AI, points out that a model can meet all OSAID requirements, despite the fact that the data used to train it is not freely available. Is it “open” if you have to pay thousands to inspect private image stores that a model’s creators have paid to license?

“To be of practical value, particularly to businesses, any definition of open source AI must provide reasonable confidence that what is licensed maybe to be licensed for how an organization uses it,” Antiga told TechCrunch. “By neglecting to address training data licensing, OSI leaves a loophole that will make the terms less effective in determining whether OSI-licensed AI models can be adopted in real-world situations.”

In version 1.0 of OSAID, OSI also does not address copyright with respect to AI models and whether granting a copyright license would be sufficient to ensure that a model meets the open source definition. It is not yet clear whether the models – or components of the models – maybe be copyrighted under current IP legislation. But if the courts decide they can be, OSI suggests new “legal tools” may be required to properly open IP-protected designs.

Maffulli agreed that the definition will need updates — perhaps sooner rather than later. To that end, OSI has established a committee that will be responsible for monitoring how OSAID is applied and proposing amendments for future versions.

“This is not the work of some lone genius in a basement,” he said. “It’s open work with broad stakeholders and different interest groups.”

Association-anemone

Association-anemone

We finally have an “official” definition for open source AI

Open AI

Divergent opinions

Open questions

Online Niel

Voting problems? ABC15 looks at the issues reported on Election Day

Monetizing Gen Z’s Chronic Online Value

Man arrested after assaulting three people, including MNPD officer

A ‘giant step forward’ to stop child-grooming predators | Policy | News

We finally have an “official” definition for open source AI

Open AI

Divergent opinions

Open questions

Online Niel

You Might Also Like

Voting problems? ABC15 looks at the issues reported on Election Day

Monetizing Gen Z’s Chronic Online Value

Man arrested after assaulting three people, including MNPD officer

A ‘giant step forward’ to stop child-grooming predators | Policy | News