There is no open source AI.

The illusion of "open weights" and what people are calling "open source" AI.

Apr 16, 2025

In his 2025 keynote at Intel Vision, Jim Zemlin, the executive director of The Linux Foundation, an organization that shepherds the largest and most important free and open source software projects in the world, like the Linux kernel, Pytorch, and Kubernetes, said this:

On January 27th this year, Nvidia stock took a nose dive after Chinese AI model DeepSeek launched to incredible fanfare - here was this fully open source model trained at a fraction of the cost, freely available for anyone to use.
Investors panicked and had a huge selloff.
In reality though, Nvidia soon bounced back. The market quickly realized that open source like DeepSeek expands the overall market for AI hardware - more models, in more hands, means that the market will shift to inference workload as opposed to training demand.

The proposition here is a classic open source project playbook: more open software in more hands leads to greater adoption and deeper innovation through collaboration across all industries.

Unfortunately, there is no “open source” AI and innovation will be stifled by a locked down ecosystem that only a few AI labs can access.

The Northern Whale Fishery: The "Swan" and "Isabella", c. 1840 - National Gallery - Public Domain

Today, models that claim to be “open source” are a fragile reflection of what free (as in freedom) and open source software really is: the freedom to study how a program works, see its source code, distribute it based on the license, change it, modify it, fix problems, and rebuild it yourself.

What is openly available with DeepSeek’s R1 model is not the source code, nor the training runbooks, and not even the training data. No, just like so many of its predecessors (like Meta’s Llama models, the Mistral Mixtral models, and Microsoft’s Phi models), DeepSeek simply released the network weights for R1.

What exactly are these weights? Well, they’re essentially the “learned” parameters in the neural network. These are represented by a huge number of floating point numbers often with large precision. These parameters are the result of massive machine learning jobs that have been run at scale across big clusters of GPUs: weights are the scaffolding that can then be used to generate output based on tokens from some input. This is exactly how tools like llama.cpp and Ollama work: they utilize the weights, scaffold them into their inference runtime, to then generate some output based on the user’s input.

In other words, distributing open weights is no different than distributing compiled binaries. And just like compiled binaries, open network weights are very difficult to modify, study, or fix.

Even worse, it’s nearly impossible to understand how a neural network operates by only inspecting its weights.

For this exact reason, machine learning algorithms and techniques are often called “black-boxes”: difficult to see into and arduous to try and understand from the outside. Exactly how a model derived its output is often an unasked question. Instead, researchers apply the scientific method to these black-boxes to understand at scale how these things work. Let me repeat that: there are maybe a few hundred people spread across academia and industry who truly understand how these massive models work end to end. I.e., almost all machine learning researchers understand complex models and algorithms through the lens of their empirical data, not their true theoretical mechanisms.

And this makes sense! Many machine learning techniques, like deep learning, transformers, and neural networks, process data through changing multi-dimensional, many-layered, non-linear algorithms which are intended to “learn” and accurately capture the nuance of non-deterministic data and landscapes.

So, in short, because it’s nearly impossible to understand how a “black-box” Large Language Model works by only looking at its weights, distributing the weights alone does not make it “open source”. Users do not have the freedom to study these systems, they do not have the freedom to change the way the models work, they do not have the freedom to fix bugs nor submit patches, and they do not have the freedom to truly control the software they use that integrates with LLMs.

What about fine tuning? One could argue that the freedoms in open source software I’m describing could be fulfilled through fine tuning mechanisms: this is generally the process of taking the weights and parameters of a pre-trained model and further extending them for a specific task or use-case. Unfortunately, this is not in the ethos of open source and only bolsters the argument that “open weights are not open source”: fine tuning doesn’t give you any deeper understanding of how a model works, what it was trained on, or the ability to fix an underlying issue with the system; it simply allows you to apply more machine learning to modify the weights in a specific way you want. In my eyes, it’s no different than modding a closed-source video game: you do not have the freedom to view the games source code and you often end up stumbling around guessing at how certain systems work in order to make the changes you want.

Unfortunately, even the shallow freedoms of fine tuning are out of reach for the every-day user: underpinning this entire discussion is an assumption that a user can run these types of machine learning jobs with the necessary hardware. To fine tune or train on a fully un-quantized model (where the precision of floating point values in the weights are precise and un-truncated), you’d need an entire rack of enterprise grade GPUs. Even a cutting edge consumer grade graphics card like the Nvidia 5090 struggles to run a full 70 billion parameter, un-quantized open weight model.

If I can’t study the source, if I can’t modify it for my needs, if I can’t submit patches, and if I can’t really run the full model, it all begs the question: why release the weights in the first place? Sure, it’s cool that I can run a quantized version of a model on my Mac without an internet connection or a third party provider. All you need is to download it ahead of time and use something like Ollama to run it. But there are, unfortunately, much more nefarious reasons the biggest AI labs would want to publish their weights. After all, remember, this whole thing is a big race to the bottom: we’ll either discover some new scaling problem that will prevent the current transformer technology from getting any better or we’ll create god: a basilisk, some sort of artificial super intelligence that may or may not align with humanities best interests.

The first AI lab to do this will effectively be dubbed “the winner” in the market. And releasing weights behind the mask of a generous “open source” strategy is a fantastic way to undercut, slow down, and distract your competition. Where open source has traditionally been used as a vehicle to drive cooperation, collaboration, and innovation, here, it’s being used in the guise of pure capitalistic, corporate greed.

In the book Free Software, Free Society, Richard Stallman reflects on what made the early days of computing and sharing software so difficult:

… the first step in using a computer was to promise not to help your neighbor. A cooperating community was forbidden. The rule made by the owners of proprietary software was:
If you share with your neighbor, you are a pirate.
If you want any changes, beg us to make them.

We find ourselves in a very similar place: we must beg OpenAI, Microsoft, Google, and Anthropic to build the features we want. If we have problems, we are at their mercy for a fix. And information on how these systems work, which continue to be more and more integrated with our lives, is scarce at best.

What would it take for LLMs and AI models to be truly “free” and “open source”?

AI labs would need to make it so that anyone could rebuild and reproduce the model. Release the source code! Release the training runbooks! Release the training data!

And herein, laying in wait, is the true dragon in the den: the training data! Many of the largest AI labs have scrapped every last bit of content on the internet in order to train the largest and most powerful models. In many cases, because we’ve effectively exhausted content on the public internet, we’ve run out of things to train new models on. Instead, we look to other models to distill new synthetic content for newer and bigger models.

Worse yet, much of the data used in training these models may have been illegally used: in an investigation by The Atlantic magazine, researchers found that millions of books and scientific papers were pirated by Meta in order to feed larger and larger Llama models. This is content that they very likely do not have the copyright on and cannot redistribute (even if its been processed through an LLM or regurgitated through synthetic data). This, of course, is leading to a huge, long legal battle: one that may paint the picture of how AI labs can train their models on copyrighted material in the future.

It’d be nearly impossible for an AI lab to make a truly open and free (as in freedom) LLM: it would require that they release the training data in order to allow users to replicate and reproduce the software. And, because so much of the training data is copyrighted material, they find themselves unable to do anything.

Instead, they’ve settled for a half measure: release the weights, just call it “open source”, and wave off anyone who questions it otherwise. We collectively have been told a well coordinated lie: people in positions of extreme power claim that trending open weight models are “fully open source”, the Open Source Initiative has bent the knee with a neutered definition of “open source AI”, and AI labs continue to take power away from users.

We are not free, we have been fooled.

Herein is the great irony of LLMs: unless something fundamentally changes in society with licensing, copyright, and “the commons”, a truly open source LLM can never exist. Because Large Language Models require the entire content of the internet to function coherently, releasing petabytes and petabytes of raw, unfiltered internet content would almost automatically lead to some sort of copyright take-down action.

This in itself challenges the very concept of copyright: in the age of AI, does anyone have copyright over anything? Should we just continue to feed the ever growing AI basilisk? Who holds copyright on material generated by AI? Who really owns “the commons”? Do AI labs have the right to endlessly feed off the collective commons and public works, building bigger and bigger models that give them more and more power?

In a world that seems to be rocketing towards a future deeply integrated with AI, we must ask ourselves the question of “who” we want to control the future of this technology. Today, it is no community, it is not the technologists, and it certainly isn’t the users. More and more, it seems the largest AI labs hold all the power and we are placated by the mirage of “open source” AI.

Open weights are not open source.

Latter in Free Software, Free Society, Stallman quotes Hillel the Elder:

If I am not for myself, who will be for me? If I am only for myself, what am I? If not now, when?

Through the lens of the early computing industry and being locked in to proprietary operating systems, taking Hillel’s advice, Stallman went on to create GNU. He fought to create a digital commons where knowledge could be freely shared. He enabled the entire technology industry and world economy to be free. He conceptualized the legal precedence that makes all free and open source software possible today.

And since the dawn of the open source movement, there has been an epic struggle between corporate greed, innovation, and freedom. A three-way tug of war, forever suspended in balance, never tipping too far to one side without a correction. But we find ourselves at a critical juncture in time, the scales of power are teetering too far, possibly never to be restored.

Proprietary AI technology is the perfect corporate weapon to the open source software movement. Impossible to truly make free, incredulous to run yourself, difficult to study, yet, with artifacts that are “good enough”, convince the wider technology world that it is indeed “open”.

We must ask ourselves similar, uncomfortable questions Stallman asked: in the age of AI, who is for me? Large, for-profit AI labs? Billionaire figureheads?

When will we be set free?

The age of AI is now. And if we are not free now, when?

Open Source Ready

Discussion about this post