1487

submitted 1 year ago by [email protected] to c/[email protected]

204 comments fedilink hide all child comments

you are viewing a single comment's thread
view the rest of the comments

[-] [email protected] 88 points 1 year ago

This is kind of a dumb argument, isn't it?

I have to imagine someone centuries ago probably complained about inventors wasting their time on some dumb printing presses so smart people could write books and newspapers better when they could have been building better farm tools. But could we have developed the tractor when we did if we were still handwriting everything?

Progress supports progress. Teaching computers to recognize and reproduce pictures might seem like a waste to some people, but how do you suppose a computer will someday disassemble a ship if it is not capable of recognizing what the ship is and what holds it together? Modern AI is primitive, but it will eventually lead to autonomous machines that can actually do that work intelligently without blindly following an instruction set, oblivious to whatever might be actually happening around it.

[-] [email protected] 2 points 1 year ago* (last edited 1 year ago)

~~I get the sentiment, but it's a bad example. Transformer models don't recognize images in any useful way that could be fed to other systems.~~ They also don't have any capability of actual understanding or context. Heavily simplifying here, tokenisation of inputs allows them to group clusters of letters together into tokens, so when it receives tokens it can spit out whatever the training data says it should.

~~The only actual things that are improving greatly here which could be used in different systems are natural language processing, natural language output and visual output.~~

EDIT: Crossed out stuff that is wrong.

[-] [email protected] 11 points 1 year ago

Well, this is simply incorrect. And confidently incorrect at that.

Vision transformers (ViT) is an important branch of computer vision models that apply transformers to image analysis and detection tasks. They perform very well. The main idea is the same, by tokenizing the input image into smaller chunks you can apply the same attention mechanism as in NLP transformer models.

ViT models were introduced in 2020 by Dosovitsky et. al, in the hallmark paper "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale" (https://arxiv.org/abs/2010.11929). A work that has received almost 30000 academic citations since its publication.

So claiming transformers only improve natural language and vision output is straight up wrong. It is also widely used in visual analysis including classification and detection.

[-] [email protected] 1 points 1 year ago

Thank you for the correction. So hypothetically, with millions of hours of GoPro footage from the scuttle crew, and if we had some futuristic supercomputer that could crunch live data from a standard definition camera and output decisions, we could hook that up to a Boston dynamics style robot and run one replaced member of the crew?

[-] [email protected] 1 points 1 year ago

And such is the march of progress.

load more comments (6 replies)

load more comments (12 replies)

this post was submitted on 26 Feb 2024

1487 points (94.9% liked)

Microblog Memes

8357 readers

3499 users here now

A place to share screenshots of Microblog posts, whether from Mastodon, tumblr, ~~Twitter~~ X, KBin, Threads or elsewhere.

Created as an evolution of White People Twitter and other tweet-capture subreddits.

Rules:

Please put at least one word relevant to the post in the post title.
Be nice.
No advertising, brand promotion or guerilla marketing.
Posters are encouraged to link to the toot or tweet etc in the description of posts.

Related communities:

founded 2 years ago

MODERATORS

[email protected]