The other shoe drops on genAI



Reality has hit the AI hype machine. On Alphabet’s recent earnings call, CEO Sundar Pichai touted widespread adoption of Google Cloud’s generative AI (genAI) solutions, but with a caveat—and a big one. “We are driving deeper progress on unlocking value, which I’m very bullish will happen. But these things take time.” The TL;DR? There’s a lot of genAI tire-kicking, and not much adoption for serious applications that generate revenue.

This is probably for the best because it gives us time to figure out what the heck we mean by “open source AI.” This matters, because we’re told by Meta CEO Mark Zuckerberg and others that open source will dominate large language models (LLMs) and AI, generally. Maybe. But while the OSI and others are trying to committee their way to an updated Open Source Definition (OSD), powerful participants like Meta are releasing industry-defining models, calling them “open source,” and not remotely caring when some vocally chastise them for affixing a label that doesn’t seem to fit the OSD. In fact, basically none of today’s models are “open source” in the way we’ve traditionally considered the term.

Does it matter? Some will insist that not only does it absolutely matter, it’s The Most Important Thing. If so, we’re nowhere near a solution. As summarized by OSI executive director Stefano Mafulli, “dabbling with an AI model could require access to the trained model, its training data, the code used to preprocess this data, the code governing the training process, the underlying architecture of the model, or a host of other, more subtle details.” This isn’t a mere matter of having access to code. The heart of the problem is data.

You keep using that word

“If the data aren’t open, then neither is the system,” argues Julia Ferraioli, a participant in the OSI’s committee to define open source for AI. This is true, she continues elsewhere, because an AI model is not open in any useful way if you don’t have the data used to train it. In AI, there’s no such thing as code without the data that animates it and gives it purpose.

Parenthetical note: I do find it a bit ironic that a host of AWS employees, including Ferraioli, make this argument, because it’s similar to what I and others have said about the cloud. What does software mean without the hardware configurations that give it life? Some, particularly employees of the big clouds, believe that such software can’t truly be open if it makes it hard for clouds to run the software without open sourcing their associated infrastructure. OK. But how is that wildly different from them demanding others’ data so they can run those models for their customers? I don’t think the cloud employees are operating in bad faith. I just think they’ve been insufficiently introspective on the issue. This is why I’ve made the cased that to fix deficiencies in open source AI, we need to revisit similar deficiencies in open source cloud.

Meanwhile, the companies with lots of data have absolutely no incentive to bend on the issue (just as the cloud companies have little incentive to capitulate on copyleft issues), largely because it’s not at all clear that developers care. One industry open source executive, who asked to remain anonymous, suggests that developers aren’t interested in the open source positioning. According to him, “AI devs don’t care and don’t want the lecture” from the OSI or others on what open means. Zuckerberg certainly fits that description. Without a trace of irony, he went on a long diatribe about the value of open source: “The path for Llama to become the industry standard is by being consistently competitive, efficient, and open, generation after generation.”

Except Llama is not open. At least, not according to Mafulli and others of the OSI persuasion. Again, does it matter? After all, many developers are happily using Meta’s Llama 2, unconcerned that it doesn’t meet a stringent definition of open source. It’s open enough, apparently.

Good enough? Open enough?

Even among well-meaning, and well-informed open source folks, there’s no consensus on what must be open in AI to qualify as “open source.” Jim Jagielski, for example, dismisses the idea that data is essential to open source AI. Even if we like the idea of opening up training data, doing so could open up all sorts of privacy and distribution complications.

The OSI expects to have a draft of their definition of open source for AI by October. Given that it’s almost August and key participants like Ferraioli note that important components of the OSAID are “woefully misguided,” “ambiguous,” and have “fallen quite short of the mark,” it’s doubtful that the industry will have much clarity by October. Meanwhile, Meta and others (and basically no one is as open as the OSI would like) will continue to release open models and usually will call them “open source.” They’ll do so because some, like European regulators, want to see the cozy term “open source” slapped on the software and AI they embrace.

Again, will it matter? Does muddying what open source means bring the industry to a halt? Doubtful. Developers are already voting with their keyboards, using Llama 2 and other “open-enough” models. For the OSI to get in front of this momentum, it’s going to have to take a principled yet pragmatic approach to open source and stop following the dogmatic dictates of its most vociferous fans. It didn’t do this for cloud, which is why we have so much unsettled legal ground to cover for AI.

Latest articles

spot_imgspot_img

Related articles

Leave a reply

Please enter your comment!
Please enter your name here

spot_imgspot_img