Hacker Newsnew | past | comments | ask | show | jobs | submit | reedciccio's commentslogin

Delayed Open Source, and it's an old practice https://opensource.org/delayed-open-source-publication

That means that Graphene OS is "eventually open source", which is a practice as old as open source (call it free software, if you prefer) itself. More on https://opensource.org/delayed-open-source-publication


sure but for an end user of GrapheneOS there's more and more code i cannot see but I must trust. The closed driver modules are bad enough.


I'm extremely interested in this topic. Would you be able to share your presentation?


Any architect and urban planner is aware of the problems with flatlands near river beds. It's the politicians who decide to ignore the science, known since the times of the Romans.


Yes. Government subsidized insurance bears much of the blame.


Is it Llama violating the "copyright" or is it the researcher pushing it to do so?


If you distribute a zip file of the book, are you violating copyright, or is it the person who unzips it?


If you walk through the N-gram database with a copy of Harry Potter in hand and observe that for N=7, you can find any piece of it in the database with above-average frequency, does that mean N-gram database is violating copyright?


Not unless you can reproduce large portions of Harry Potter verbatim from the database. If the 7-grams are taken only from Harry Potter, that is very likely.


If the database is sharing those pieces, it might be yes.

Copyright takes into account the use for such the copying is done. Commercial use will almost always be treated as not fair use, with limited exceptions.


I'd say no, because you can't reasonably access and order those pieces without already having the work at your side to use as a reference.


You are.

Copyright is quite literally about the right to control the creation and distribution of copies.

The creation of the unzipped file is not treated as a separate copy so the recipient would not be violating copyright just by unzipping the file you provided.


It's very imperfect analogy though these things can't be rebuilt "from scratch" like a program, the training process doesn't seem to be replicable anyway. Nonetheless, full data disclosure is necessary, according to the result of the years-long consultation led by the Open Source Initiative https://opensource.org/ai


> the training process doesn't seem to be replicable anyway

The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.

If you're speaking about the computational cost, it used to be that way for compilers too. Give it 20 years and you'll be able to train one of today's models on your phone.


> The training process is fully deterministic. It's just an algorithm. Feed the same data in and you'll get the same weights out.

No it is not. The training process is non-deterministic, and given exactly the same data, the same code and the same seeds you'll get different weights. Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using (e.g. you'll get different results on CPU, on GPU from vendor #1 and on GPU from vendor #2, and probably on different GPUs from the same vendor, and on different CUDA versions, etc.), but also depending on the dimensions of the matrices you'll get different results (e.g. if you fuse the QKV weights from modern transformers into a single matrix and do a single multiplication instead of multiplying each separately you'll get different results), and some algorithms (e.g. backwards pass of Flash Attention) are explicitly non-deterministic to be faster.


> Even the simplest operations like matrix multiplication will give you slightly different results depending on the hardware you're using

That has everything to do with implementation, and nothing to do with algorithm. There is an important difference.

Math is deterministic. The way [random chip] implements floating point operations may not be.

Lots of scientific software has the ability to use IEEE-754 floats for speed or to flip a switch for arbitrary precision calculations. The calculation being performed remains the same.


> Math is deterministic.

The point is none of these models are trained with pure "math". It doesn't matter that you can describe a theoretical training process using a set of deterministic equations, because in practice it doesn't work that way. Your claim that "the training process is fully deterministic" is objectively wrong in this case because none of the non-toy models use (nor they practically can use) such a deterministic process. There is a training process which is deterministic, but no one uses it (for good reasons).

If you had infinite budget, exactly the same code, the same training data, and even the same hardware you would not be able to reproduce the weights of Deepseek R1, because it wasn't trained using a deterministic process.


A lot of quibbling here, wasn't sure where to reply. If you've built any models in PyTorch, then you know. Conceptually it is deterministic, a model trained using deterministic implementations of low level algorithms will produce deterministic results. And when you are optimizing the pipeline, it is common to do just that:

    torch.manual_seed(0)
    random.seed(0)
    np.random.seed(0)
    torch.use_deterministic_algorithms(True)
But in practice that is too slow, we use nondeterministic implementations that run fast and loose with memory management and don't necessarily care about the order in which parallel operations return.


I’m pretty sure the initial weights are randomized meaning no two models will train in the same way twice. The order in which you feed in training data to the model would also add an element of randomness. Model training is closer to growing a plant than running a compiler.


That's still a deterministic algorithm. The random data and the order of feeding training data into it are part of the data which determines the output. Again, if you do it twice the same way, you'll get the same output.


If they saved the initial randomized model and released it and there was no random bit flipping during copying, then possibly but it would still be difficult when you factor in the RLHF that comes about through random humans interacting with the model to tweak its workings. If you preserved that data as well, and got all of the initial training correct... maybe. But I'd bet against it.


So long as the data provided was identical, and sources of error like floating point errors due to hardware implementation details are accounted for, I see no reason output wouldn't be identical.

Where would other non-determinism come from?

I'm open to there being another source. I'd just like to know what it would be. I haven't found one yet.


> if you do it twice the same way, you'll get the same output

Point at the science that says that, please: Current scientific knowledge doesn't agree with you.


> Current scientific knowledge doesn't agree with you.

I'd love a citation. So far you haven't even suggested a possible source for this non-determinism you claim exists.


What makes models non-deterministic isn't the training algorithm, but the initial weights being random.

Training is reproducible only if, besides the pipeline and data, you also start from the same random weights.


That would fall under "Feed the same data in and you'll get the same weights out." Lots of deterministic algorithms use a random seed.


So is there no “introduce randomness” at some step afterwards? If not, I would guess these models would be getting stuck in a local maxima


> If not, I would guess these models would be getting stuck in a local maxima

It sounds like you're referring to something like simulated annealing. Using that as an example, the fundamental requirement is to introduce arbitrary, uncorrelated steps -- there's no requirement that the steps be random, and the only potential advantage of using a random source is that it provides independence (lack of correlation) inherently; but in exchange, it makes testing and reproduction much harder. Basically every use of simulated annealing or similar I've run into uses pseudorandom numbers for this reason.


Can you point at the research that says that the training process of a LLM at least the size of OLMo or Pythia is deterministic?


Can you point to something that says it's not? The only source of non-determinism I've read of affecting LLM training is floating point error which is well understood and worked around easily enough.


Search more, there is a lot of literature discussing how hard the problem of reproducibility of GenAI/LLMs/Deep Learning is, how far we are from solving it for trivial/small models (let alone for beasts the size of the most powerful ones) and even how pointless the whole exercise is.


If there's a lot, then it should be easy for you to link an example right? One that points toward something other than floating point error.

There simply aren't that many sources of non-determinism in a modern computer.

Though I'll grant that if you've engineered your codebase for speed and not for determinism, error can creep in via floating point error, sloppy ordering of operations, etc. These are not unavoidable implementation details, however. CAD kernels and other scientific software do it every day.

When you boil down what's actually happening during training, it's just a bunch of matrix math. And math is highly repeatable. Size of the matrix has nothing to do with it.

I have little doubt that some implementations aren't deterministic, due to software engineering choices as discussed above. But the algorithms absolutely are. Claiming otherwise seems equivalent to claiming that 2 + 2 can sometimes equal 5.


> I have little doubt that some implementations aren't deterministic

Not some of them; ALL OF THEM. Engineering training pipelines for absolute determinism would be, quite frankly, extremely dumb, so no one does it. When you need millions of dollars worth of compute to train a non-toy model are you going to double or triple your cost just so that the process is deterministic, without actually making the end result perform any better?


Depends on how much you value repeatability in testing, and how much compute you have. It's a choice which has been made often in the history of computer science.

The cost of adaptive precision floats can be negligible depending on application. One example I'm familiar with from geometry processing: https://www.cs.cmu.edu/~quake/robust.html

Integer math often carries no performance penalty compared to floating point.

I guess my takeaway from this conversation is that there's a market for fast high-precision math techniques in the AI field.


Https://opensource.org/ai ... Lots of reasoning has been done on those artifacts


The Open Source Definition is quite clear on its #2 requirement: `The source code must be the preferred form in which a programmer would modify the program. Deliberately obfuscated source code is not allowed.` https://opensource.org/osd


Arguably this would still apply to deepseek. While they didn’t release a way of recreating the weights, it is perfectly valid and common to modify the neural network using only what was released (when doing fine-tuning or RLHF for example, previous training data is not required). Doing modifications based on the weights certainly seems like the preferred way of modifying the model to me.

Another note is that this may be the more ethical option. I’m sure the training data contained lots of copyrighted content, and if my content was in there I would prefer that it was released as opaque weights rather than published in a zip file for anyone to read for free.


It takes away the ability to know what it does though, which is also often considered an important aspect. By not publishing details on how to train the model, there's no way to know if they have included intentional misbehavior in the training. If they'd provide everything needed to train your own model, you could ensure that it's not by choosing your own data using the same methodology.

IMO it should be considered freeware, and only partially open. It's like releasing an open source program with a part of it delivered as a binary.


It's not that they want to keep the training content secret, it's the fact that they stole the training content, and who they stole it from, that they want to keep secret.


There are more, like the work by Eleuther AI and LLM360.


That's what the Open Source AI Definition states https://opensource.org/ai

In any case, Deepseek like Llama fail much before hitting that new definition. Both have licenses containing restrictions on field of use and discrimination of users. Their license will never be approved as Open Source.


Consider applying for YC's Winter 2026 batch! Applications are open till Nov 10

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: