[HN Gopher] The killer app of Gemini Pro 1.5 is video
       ___________________________________________________________________
        
       The killer app of Gemini Pro 1.5 is video
        
       Author : simonw
       Score  : 451 points
       Date   : 2024-02-21 19:23 UTC (3 hours ago)
        
 (HTM) web link (simonwillison.net)
 (TXT) w3m dump (simonwillison.net)
        
       | minimaxir wrote:
       | Note that a video is just a sequence of images: OpenAI has a demo
       | with GPT-4-Vision that sends a list of frames to the model with a
       | similar effect:
       | https://cookbook.openai.com/examples/gpt_with_vision_for_vid...
       | 
       | If GPT-4-Vision supported function calling/structured data for
       | guaranteed JSON output, that would be nice though.
       | 
       | There's shenanigans you can do with ffmpeg to output every-other-
       | frame to halve the costs too. The OpenAI demo passes every 50th
       | frame of a ~600 frame video (20s at 30fps).
       | 
       | EDIT: As noted in discussions below, Gemini 1.5 appears to take 1
       | frame every second as input.
        
         | simonw wrote:
         | The number of tokens used for videos - 1,841 for my 7s video,
         | 6,049 for 22s - suggests to me that this is a much more
         | efficient way of processing content than individual frames.
         | 
         | For structured data extraction I also like not having to run
         | pseudo-OCR on hundreds of frames and then combine the results
         | myself.
        
           | og_kalu wrote:
           | No it's individual frames
           | 
           | https://developers.googleblog.com/2024/02/gemini-15-availabl.
           | ..
           | 
           | "Gemini 1.5 Pro can also reason across up to 1 hour of video.
           | When you attach a video, Google AI Studio breaks it down into
           | thousands of frames (without audio),..."
           | 
           | But it's very likely individual frames at 1 frame/s
           | 
           | https://storage.googleapis.com/deepmind-
           | media/gemini/gemini_...
           | 
           | "Figure 5 | When prompted with a 45 minute Buster Keaton
           | movie "Sherlock Jr." (1924) (2,674 frames at 1FPS, 684k
           | tokens), Gemini 1.5 Pro retrieves and extracts textual
           | information from a specific frame in and provides the
           | corresponding timestamp. At bottom right, the model
           | identifies a scene in the movie from a hand-drawn sketch."
        
             | simonw wrote:
             | Despite that being in their blog post, I'm skeptical. I
             | tried uploading a single frame of the video as an image and
             | it consumed 258 tokens. The 7s video was 1,841 tokens.
             | 
             | I think it's more complicated than just "split the video
             | into frames and process those" - otherwise I would expect
             | the token count for the video to be much higher than that.
             | 
             | UPDATE ... posted that before you edited your post to link
             | to the Gemini 1.5 report.
             | 
             | 684,000 (total tokens for the movie) / 2,674 (their frame
             | count for that movie) = 256 tokens - which is about the
             | same as my 258 tokens for a single image. So I think you're
             | right - it really does just split the video into frames and
             | process them as separate images.
        
               | infecto wrote:
               | Edit: Was going to post similar to your update. 1841/258
               | = ~7
        
               | Arelius wrote:
               | I mean, that's just over 7 frames, or one frame/s of
               | video. There are likely fewer then that many I-frames in
               | your video.
        
               | simonw wrote:
               | Added a note about this to my post:
               | https://simonwillison.net/2024/Feb/21/gemini-pro-
               | video/#imag...
        
             | Zetobal wrote:
             | The model is fed individual frames from the movie BUT the
             | movie is segmented into scenes. These scenes, are held in
             | context for 5-10 scenes, depending on their length. If the
             | video exceeds a specific length or better said a threshold
             | of scenes it creates an index and summary. So yes
             | technically the model looks at individual frames but it's a
             | bit more tooling behind it.
        
           | minimaxir wrote:
           | From the Gemini 1.0 Pro API docs (which may not be the same
           | as Gemini 1.5 in Data Studio):
           | https://cloud.google.com/vertex-ai/docs/generative-
           | ai/multim...
           | 
           | > The model processes videos as non-contiguous image frames
           | from the video. Audio isn't included. If you notice the model
           | missing some content from the video, try making the video
           | shorter so that the model captures a greater portion of the
           | video content.
           | 
           | > Only information in the first 2 minutes is processed.
           | 
           | > Each video accounts for 1,032 tokens.
           | 
           | That last point is weird because there is no way a video
           | would be a fixed amount of tokens and I suspect is a typo.
           | The value is exactly 4x the number of tokens for an image
           | input to Gemini (258 tokens) which may be a hint to the
           | implementation.
        
         | belter wrote:
         | Prompt injection via Video?
        
           | nomel wrote:
           | Probably: https://simonwillison.net/2023/Oct/14/multi-modal-
           | prompt-inj...
        
         | ankeshanand wrote:
         | We've done extensive comparisons against GPT-4V for video
         | inputs in our technical report:
         | https://storage.googleapis.com/deepmind-
         | media/gemini/gemini_....
         | 
         | Most notably, at 1FPS the GPT-4V API errors out around 3-4
         | mins, while 1.5 Pro supports upto an hour of video inputs.
        
           | jxy wrote:
           | So that 3-4 mins at 1FPS means you are using about 500 to 700
           | tokens per image, which means you are using `detail: high`
           | with something like 1080p to feed to gpt-4-vision-preview
           | (unless you have another private endpoint).
           | 
           | The gemini 1.5 pro uses about 258 tokens per frame (2.8M
           | tokens for 10856 frames).
           | 
           | Are those comparable?
        
           | verticalscaler wrote:
           | The average shot length in modern movies is between 4 and 16
           | seconds and around 1 minute for a scene.
        
           | moralestapia wrote:
           | >while 1.5 Pro supports upto an hour of video inputs
           | 
           | At what price, tho?
        
       | rpastuszak wrote:
       | hehe, this is great, I was _just_ (2 days ago) playing with a
       | similar problem in a web app form: browsing books in the foreign
       | literature section of a Portuguese bookstore!
       | 
       | My (less serious) ultimate goal is a universal sock pairing app:
       | never fold your socks together again, just dump them in the
       | drawer and ask the phone to find a match when you need them!
       | 
       | This seems more like a visual segmentation problem though and
       | segmentation has failed me so far.
        
         | heckelson wrote:
         | I employ a different strategy: I own 25 pairs of the same gray
         | socks (gray was chosen so that it matches most outfits) and I
         | just wear those all the time. Obviously I do own other socks
         | (for suits etc.) but it has cumulatively saved me hours of sock
         | searching.
        
           | mewpmewp2 wrote:
           | Yes, I tried to employ this same strategy, but maybe it's
           | because of my ADD or something, but I never manage to buy the
           | same bulk socks, and eventually I run out and try to buy
           | another bulk of socks which starts to get mixed with the last
           | ones.
           | 
           | I need a robot that can physically sort and organize
           | absolutely everything in my living space.
           | 
           | I have ideas for different strategies, but I am never able to
           | actually implement those, so it ends up that I panic search
           | for good pair of socks when there's an important event or
           | just any scenario where someone would see me in socks and it
           | would be good if socks looked similar enough.
        
         | ta8645 wrote:
         | I'd prefer an app that can find the missing socks for all the
         | singletons that emerge from each load of laundry. We'll
         | probably have to wait for a super AGI though.
        
       | jgalt212 wrote:
       | text prompt -> LLM -> unity -> video
       | 
       | bim, bam, boom!
        
       | sotasota wrote:
       | How does this particular use case stack up against OCR?
        
         | rmbyrro wrote:
         | I think OCR would fair pretty poorly on such messy visuals.
         | 
         | Not to mention the partially obscured titles that Gemini
         | guessed well, which would be impossible for an OCR.
        
       | GaggiX wrote:
       | >GPT-4 Video and LLaVA expanded that to images.
       | 
       | A little error in the page: GPT-4V stands for vision, not video.
        
         | simonw wrote:
         | Thanks, fixed.
        
       | ilaksh wrote:
       | So you have to get invited to use Gemini Pro 1.5 right? EDIT:
       | there is a waitlist here
       | https://aistudio.google.com/app/waitlist/97445851
        
       | it_learnses wrote:
       | It's sad that Google ai studio is not available in Canada.
        
       | TheCaptain4815 wrote:
       | I wonder if the real killer app is Googles hardware scale verses
       | OpenAi' s(or what Microsoft gives them). Seems like nothing
       | Google's done has been particular surprising to OpenAi's team,
       | it's just they have such huge scale maybe they can iterate
       | faster.
        
         | danpalmer wrote:
         | And the fact that Google are on their own hardware platform,
         | not dependent on Nvidia for supply or hardware features.
        
         | dist-epoch wrote:
         | The real moat is that Google has access to all the video
         | content from YouTube to train the AI on, unlike anyone else.
        
           | sarreph wrote:
           | I'm not sure I would necessarily call YouTube a moat-creator
           | for Google, since the content on YouTube is for all intents
           | and purposes public data.
        
       | samstave wrote:
       | Everyone is missing the point, it seems (please BOFH me when
       | wrong);
       | 
       | Its not going to be all about "llms" and this app or that app...
       | 
       | They all will talk, just like any other ecosystem, but this one
       | is going to be different... it can ferret out connections as BGP
       | will route.
       | 
       | Gimme an AI from here, with this context, and that one and yes,
       | please Id like another...
       | 
       | and it will create soft LLMs - temporal ones dedicated to their
       | prompt and will pull from the tentriles of knowledge it can grasp
       | and give you the result.
       | 
       | AI creates IRL Human Ephemeral Storage.
        
         | samstave wrote:
         | Pre-emptive temporal curated LLMs in ..x0x
         | 
         | Meatbag translation: The pre-emptive is the cancer that will
         | kill us.
         | 
         | Fuck you:
         | 
         | * insurance
         | 
         | * taxes
         | 
         | * health...
         | 
         | (what MAY this body-populous do, based on LLM-x trained on
         | accuarial q and reduce from Human to cellular.
         | 
         | How fucking cyberpunk dystopian would one like to get.
         | 
         | The scariest wave of intellect is those that create technology
         | before we had such technology "well, weve always been that
         | way...
         | 
         | Robots (AI) have no such "I would like to play in the yard"
        
       | technics256 wrote:
       | The real frustrating thing about this is how Gemini 1.5 is a
       | marketing ploy only.
       | 
       | Not even 1.0 Ultra is available in the GCP API. only for their
       | "allowlist" clients.
        
       | daft_pink wrote:
       | I feel that while youtubers and influencers are heavily
       | interested in video tools, most average users aren't that
       | interested in creating video.
       | 
       | I write a lot more email than sending out videos and the value of
       | those videos is mostly just for sharing my life with friends and
       | family, but my emails are often related to important professional
       | communications.
       | 
       | I don't think video tools will ever reach the level of usefulness
       | to everyday consumers that generative writing tools create.
        
         | simonw wrote:
         | That's why I'm excited about this particular example: indexing
         | your bookshelf by shooting a 30s video of it isn't producing
         | video for publication, it's using your phone as an absurdly
         | fast personal data entry device.
        
         | valleyer wrote:
         | Recall that TFA discusses analyzing video, not generating
         | video.
        
       | acid__ wrote:
       | Wow, only 256 tokens per frame? I guess a picture isn't worth a
       | thousand words, just ~192.
        
         | swyx wrote:
         | gpt4v is also pretty low but not as low. 480x640 frame costs
         | 425 tokens, 780x1080 is 1105 tokens
        
         | gwern wrote:
         | Back in 2020, Google was saying 16x16=256 words:
         | https://arxiv.org/abs/2010.11929#google :)
        
       | aantix wrote:
       | It'd be interesting to feed it several comedies, and see what it
       | would calculate as "laughs per minute".
       | 
       | https://www.forbes.com/sites/andrewbender/2012/09/21/top-10-...
        
       | ugh123 wrote:
       | Title should have _input_ added to the end
       | 
       | "The killer app of Gemini Pro 1.5 is video input"
       | 
       | Seems like a good way to do video moderation (YouTube) at scale,
       | if they can keep costs down...
        
         | CobrastanJorji wrote:
         | Probably overkill for content moderation, I'd think. You can
         | identify bad words looking only at audio, and you can probably
         | do nearly as good a job of identifying violence and nudity
         | examining still images. And at YouTube scale, I imagine the
         | main problem with moderation isn't so much as being correct,
         | but of scaling. statista.com (what's up with that site,
         | anyway?) suggests that YouTube adds something like 8 hours of
         | video per second. I didn't run the numbers, but I'm pretty sure
         | that's way too much to cost effectively throw something like
         | Gemini Pro at.
        
           | CamelCaseName wrote:
           | For now, but in a year?
           | 
           | You could also stagger the moderation to reduce costs. E.g.
           | 
           | Text analysis: 2 views
           | 
           | Audio analysis: 300 views
           | 
           | Frame analysis: 5,000 views
           | 
           | I would be very surprised if even 20% of content uploaded to
           | YouTube passes 300 views.
        
             | elzbardico wrote:
             | People assume that we can scale the capabilities of LLMs
             | indefinitely, I on the other side strongly suspect we are
             | probably getting close to diminishing returns territory.
             | 
             | There's only so much you can do by guessing the next
             | probably token in a stream. We will probably need something
             | else to achieve what people think that will soon be done
             | with LLMs.
             | 
             | Like Elon Musk probably realizing that computer vision is
             | not enough for full self-driving, I expect we will soon
             | reach the limits of what can be done with LLMs.
        
             | ugh123 wrote:
             | Or.. google supplies some kind of local LLM tool which
             | processes your videos before uploaded. You pay for the
             | gpu/electricity costs. Obviously this would need to be done
             | in a way that can't be hacked/manipulated. Might need to be
             | highly integrated with a backend service that manages the
             | analyzed frames from the local machine and verifies
             | hashes/tokens after the video is fully uploaded to YouTube.
        
             | halamadrid wrote:
             | It should be far less than 20%.
             | 
             | I guess it could also be associated with views per time
             | period to optimize better. If the video is interesting,
             | people will share and more views will happen quickly.
        
           | mrinterweb wrote:
           | I have no idea how YouTube currently moderates its content,
           | but there may be some benefit with Gemini. I'm sure Googlers
           | have been considering this option.
        
         | yieldcrv wrote:
         | yeah I need a live updated chart that tells me what kind of
         | multimodal input and output a model or service can do
         | 
         | its super confusing now because each i/o method is novel and
         | exciting _to that team_ and their users may not know what else
         | is out there
         | 
         | but for the rest of us looking for competing services its
         | confusing
        
       | jpeter wrote:
       | Next step is to use all of YouTube to train Gemini 2.0.
        
         | karmasimida wrote:
         | As long as it doesn't regenerate (I don't think google will
         | allow it), for video analysis, it is totally within google's
         | rights to do it.
        
       | whiterknight wrote:
       | Safety is becoming an orwellian word to refer to things that
       | can't actually harm you.
        
       | JieJie wrote:
       | Sure, but does it pass the Selective Attention Test?
       | 
       | https://www.youtube.com/watch?v=vJG698U2Mvo
       | 
       | (I don't know, I don't have access.)
        
       | smartmic wrote:
       | That is impressive at first glance, no question. To stay with the
       | example of the bookshelf, you would only follow this path for
       | several or very many books, as in the example with the cookbooks.
       | I have no idea how good the Geminis or GPTs of this world
       | currently are, but let's optimistically assume a 3% error rate
       | due to hallucinations or something. If I want to be sure that the
       | results are correct, then I have to go through and check each
       | entry manually. I want to rule out the possibility that there are
       | titles listed in the 3% that would completely turn an outsider's
       | world view of me upside down.
       | 
       | So, even if data entry is incredibly fast, curation is still
       | time-consuming. On balance, would it even be faster to capture
       | the ISBN code of 100 books with a scanner app, assuming that the
       | index lookup is correct, or to compare 100 JSON objects with
       | title and author for correctness?
       | 
       | The example is only partly serious. I just think that as long as
       | hallucinations occur, Generative AI will only get part of my
       | trust - and I don't know about you, but if I knew that a person
       | was outright lying to me in 3% of all his statements, I wouldn't
       | necessarily seek his proximity in things that are important to
       | me...
        
         | jpc0 wrote:
         | This right here.
         | 
         | I'm currently building out some code that should go in
         | production in the next week or two and simply because of this
         | we are using LLM to prefill data and then have a human look
         | over it.
         | 
         | For our use case the LLM prefilling the data is significantly
         | faster but if it ever gets to the point of that not needing to
         | happen it would take a task whichtakes about 3 hours ( now down
         | to one hour ) and make it a task that takes 3 minutes.
         | 
         | Will LLMs ever get to the point where it is perfectly reliable
         | ( or at least with an error margin low enough for our use case
         | ), I don't think so.
         | 
         | It does make for a very cheap accelerator though.
        
         | simonw wrote:
         | This isn't a problem that's unique to LLMs though.
         | 
         | Pay a bunch of people to go through and index your book
         | collection and you'll get some errors too.
         | 
         | What's interesting about LLMs is they take tasks that were
         | previously impossible - I'm not going to index my book
         | collection, I do not have the time or willpower to do that -
         | and turned them into things that I can get done to a high but
         | not perfect standard of accuracy.
         | 
         | I'll take a searchable index of my books that's 95% accurate
         | over no searchable index at all.
        
       | darkwater wrote:
       | I find really hard to understand how a system like this can STILL
       | be fooled by the Scunthorpe issue (this time with "cocktail").
       | Aren't LLM supposed to be good at context?
        
       | keefle wrote:
       | How would the results compare to:
       | 
       | 1. Video frames are sampled (based on frame clarity)
       | 
       | 2. The images are fed to OCR, with their content outputed as:
       | 
       | Frame X: <content of the frame>
       | 
       | 3. The accomulated text is given to an average LLM (Mistral) and
       | asked the same request mentioned by the author (creating a JSON
       | file containing book information)
       | 
       | Wouldn't we get something similar? maybe if a more sophisticed AI
       | is used? So the monopoly on Gemini Pro for video processing
       | (specifically when it comes to handling text present inside the
       | video) is not really a sustainable advantage? or am I missing
       | something (as this is something beyond just a fancy OCR hooked
       | into a LLM? as the model would be able to tell that this text is
       | on a book for instance?)
        
         | simonw wrote:
         | Sure, you can slice a video up into images and process them
         | separately - that's apparently how Gemini Pro works, it uses
         | one frame from every second of video.
         | 
         | But you still need a REALLY long context length to work with
         | that information - the magic combination here is 1,000,000
         | tokens combined with good multi-model image inputs.
        
           | keefle wrote:
           | I see, but I was wondering about the partial transferability
           | of this feature to other LLMs
           | 
           | But fair enough, context length is key in this scenario
        
       | phaser wrote:
       | This is nice. but since google is probably training on it's vast
       | google books data set, i'm not extremely surprised.
        
       | barrkel wrote:
       | I wonder if it could identify new books with titles it's never
       | seen before.
        
       | blueblimp wrote:
       | The tech is legitimately impressive and exciting, but I couldn't
       | help but chuckle at the revenge of the Scunthorpe problem:
       | 
       | > It looks like the safety filter may have taken offense to the
       | word "Cocktail"!
        
         | 1oooqooq wrote:
         | > > It looks like the safety filter may have taken offense to
         | the word "Cocktail"!
         | 
         | It's almost as if they got some intern to "code" the
         | correctness filter using some AI coding assistant!
        
       | IshKebab wrote:
       | What do the tokens for an image even look like? I understand that
       | tokens for text are just fragments of text... but that obviously
       | doesn't make sense for images.
        
         | bonoboTP wrote:
         | The image is subdivided by a grid and the resulting patches are
         | fed through a linear encoder to get the token embeddings.
        
       | luke-stanley wrote:
       | I can't access that Google AI Studio link because I'm in some
       | strange place called the UK so I'm unable to verify or prototype
       | with it currently. People at Deepmind, what's with that?
        
       | tekni5 wrote:
       | I was thinking about this a while back, once AI is able to
       | analyze video, images and text and do so cheap & efficiently.
       | It's game over for privacy, like completely. Right now massive
       | corps have tons of data on us, but they can't really piece it
       | together and understand everything. With powerful AI every aspect
       | of your digital life can be understood. The potential here is
       | insane, it can be used for so many different things good and bad.
       | But I bet it will be used to sell more targeted goods and
       | services.
        
         | worldsayshi wrote:
         | Unless you live in the EU and have laws that should protect you
         | from that.
        
           | seniorivn wrote:
           | incentives cannot be fixed with just prohibitive laws, war on
           | drags should've taught you something
        
             | garbagewoman wrote:
             | War on drags? I thought that was just in Florida
        
               | ineedaj0b wrote:
               | please consider commenting more thoughtfully. I
               | understand this is a joke but we don't want this site to
               | devolve into Reddit.
        
               | Pengtuzi wrote:
               | Don't feed egregious comments by replying; flag them
               | instead. If you flag, please don't also comment that you
               | did.
               | 
               | Please don't complain about tangential annoyances--e.g.
               | article or website formats, name collisions, or back-
               | button breakage. They're too common to be interesting.
               | 
               | Please don't post comments saying that HN is turning into
               | Reddit. It's a semi-noob illusion, as old as the hills.
        
               | ineedaj0b wrote:
               | I don't have a flag button or option my account is new
        
             | SV_BubbleTime wrote:
             | Drugs... Oooohh. I get it now.
        
             | jpk wrote:
             | Laws, and more specifically their penalties, are precisely
             | for fixing incentives. It's just a matter of setting a
             | penalty that outweighs the natural incentive you want to
             | override. e.g., Is it more expensive to respect privacy, or
             | pay the fine for not doing so? PII could, and should, be
             | made radioactive by privacy regulations and their
             | associated penalties.
        
           | YetAnotherNick wrote:
           | Is it true or more of a myth? Based on my online read, Europe
           | has "think of the children" narrative as common if not more
           | than other parts of the world. They tried hard to ban
           | encryption in apps many times.[1]
           | 
           | [1]: https://proton.me/blog/eu-council-encryption-vote-
           | delayed
        
             | smoldesu wrote:
             | > They tried hard to ban encryption in apps many times.
             | 
             | That's true of most places. We should applaud the EU's
             | human rights court for leading the way by banning this
             | behavior: https://www.eureporter.co/world/human-rights-
             | category/europe...
        
             | devjab wrote:
             | Democratic governance is complicated. It's never black and
             | white and it's perfectly possible for parts of the EU to be
             | working to end encryption while another part works toward
             | enhancing citizen privacy rights. Often they're not even
             | supported by the same politicians, but since it's not a
             | winners takes all sort of thing, it can all happen
             | simultaneously and sometimes they can even come up with
             | some "interesting" proposals that directly interfere with
             | each other.
             | 
             | That being said there is a difference between the US and
             | the EU in regards to how these things are approached. Where
             | the US is more likely to let private companies destroy
             | privacy while keeping public agencies leashed it's the
             | opposite in Europe. Truth be told, it's not like the US
             | initiatives are really working since agencies like the NSA
             | seem to blatantly ignore all laws anyway, which cause some
             | scandals here in Europe as well. In Denmark our Secret
             | Police isn't allowed to spy on us without warrants, but our
             | changing governments has had different secret agreements
             | with the US to let the US monitor our internet traffic.
             | Which is sort of how it is, and the scandal isn't so much
             | that, it's how our Secret Police is allowed to get
             | information about Danish citizens from the NSA without
             | warrants, letting our secret police spy on us by getting
             | the data they aren't allowed to gather themselves from the
             | NSA who are allowed to gather it.
             | 
             | Anyway, it's a complicated mess, and you have so many
             | branches of the bureaucracy and so many NGOs pulling in
             | different directions that you can't say that the EU is pro
             | or anti privacy the way you want to. Because it's both of
             | those things and many more at the same time.
             | 
             | I think the only thing the EU unanimously agrees on (sort
             | of) is to limit private companies access to citizen privacy
             | data. Especially non-EU organisations. Which is very hard
             | to enforce because most of the used platforms and even
             | software isn't European.
        
               | YetAnotherNick wrote:
               | I am fine with private company using my data for showing
               | me better ads. They can't affect my life significantly.
               | 
               | I am not fine with government using the data to police
               | me. Already in most countries, governments are putting
               | people in jail because of things like hate speech where
               | are the laws are really vague.
        
           | tekni5 wrote:
           | What happens if it's a datamining third party bot? That can
           | check your social media accounts, create an in-depth profile
           | on you, every image, video, post you've made has been
           | recorded and understood. It knows everything about you, every
           | product you use, where you have been, what you like, what you
           | hate, everything packaged and ready to be sold to an
           | advertiser, or the government, etc.
        
           | spacebanana7 wrote:
           | Public sector agencies and law enforcement are generally
           | exempt (or have special carve outs) in European privacy
           | regulations.
        
         | londons_explore wrote:
         | > I bet it will be used to sell more targeted goods and
         | services.
         | 
         | Plenty of companies have been shoving all the unstructured data
         | they have about you and your friends into a big neural net to
         | predict which ad you're most likely to click for a decade
         | now...
        
           | tekni5 wrote:
           | Sure but not images and video. Now they can look at a picture
           | of your room and label everything you own, etc.
        
             | londons_explore wrote:
             | yes including images and video. It's been basically
             | standard practice to take each piece of user data and turn
             | it into an embedding vector, then combine all the vectors
             | with some time/relevancy weighting or neural net, then use
             | the resulting vector to predict user click through rates
             | for ads. (which effectively determines which ad the user
             | will see).
        
         | ryukoposting wrote:
         | You nailed it on the head. People dismissing this because it
         | isn't perfectly accurate are missing the point. For the
         | purposes of analytics and surveillance, it doesn't _need_ to be
         | perfectly accurate as long as you have enough raw data to
         | filter out the noise. The Four have already mastered the
         | "collecting data" part, and nobody in North America with the
         | power to rein in that situation seems interested in doing so
         | (this isn't to say the GDPR is perfect, but at least Europe is
         | _trying_ ).
         | 
         | It's depressing that the most extraordinary technologies of our
         | age are used almost exclusively to make you buy shit.
        
       | chefandy wrote:
       | These things seem great for casual use, but not trustworthy
       | enough for archival work, for example. The world needs casual-use
       | tools, too, but there are bigger impact use cases in the
       | pipeline. I'd love for these things to communicate when they're
       | shaky on an interpretation, for example. Maybe pairing it with a
       | different model and using an adversarial approach? Getting a
       | confidence rating on existing messy data where the source is
       | available for a second pass could be a good use case.
       | 
       | Looking at this, however, my hope is soured by the exponentially
       | growing power of our law enforcement's panopticon. The existing
       | shitty, buggy facial recognition system is already bad, but
       | making automated fingerprints of people's movements based on
       | their face combined with text on clothing and bags, the logos on
       | your shoes, protest signs, alerting authorities if people have
       | certain bumper stickers or books, recording the data on every
       | card made visible when people open their wallets at public
       | transit hubs or to pay for coffee or groceries, or set up a cheap
       | remote camera across the street from a library to make a big list
       | of every book checked out correlated with facial recognition... I
       | mean, _damn._ Even in the private sector affording retailers the
       | ability to make mass databases of any logo you 've had on you
       | when walking into their stores... or any stores considering it
       | will be data brokers who keep it. Considering how much privacy
       | our society has killed with the data we have, I'm genuinely
       | concerned about what they will make next. Attempts to limit
       | Facebook, et al may well seem quaint pretty soon. How about
       | criminal applications? You can get a zoom camera with incredible
       | range for short money, and surely it wouldn't be that hard to
       | find a counter in front of a window where people show sensitive
       | documents. Even just putting a phone with the camera facing out
       | in your shirt pocket and walking around a target rich environment
       | could be useful when you can comb through that gathered data
       | looking for patterns, too.
       | 
       | That said, I'm not in security, law enforcement, crime, or
       | marketing data collection so maybe I'm full of beans and just
       | being neurotic.
       | 
       |  _Edit: if you 're going to downvote me, surely you're capable of
       | articulating your opposition in a comment, no?_
        
         | nox101 wrote:
         | honest question: Why is it bad? I see that posted over and
         | over. Right now I watch SF and LA feel like 3rd world
         | countries. Nothing appears to be enforced. Traffic laws, car
         | break-ins, car theft, garage break-ins, house break-ins.
         | 
         | I'd personally choose a little less privacy if it meant less
         | people were getting injured by drivers ignoring the traffic
         | laws and, less people were having to shell out for all the
         | costs associated with theft including replacing or repairing
         | the damaged/stolen item as well as the increased insurance
         | costs, cost that get added to everyone's insurance regardless
         | of income level. Note: car break-in, garage break-in has both
         | costs for the items stolen and costs to repair the
         | car/garage/house.
         | 
         | I don't know where to draw the line. I certainly don't want
         | cameras in my house or looking through my windows. Nor do I
         | want it on my computer or TV looking at what I do/view.
         | 
         | For traffic, I kind of feel like at a minimum, if they can move
         | the detection to the cameras and only save/transmit the
         | violations that would be okay with me. You violated the law in
         | a public space that affected others, your right to not be
         | observed ends for that moment in time. Also, if I could
         | personally send in violations I would have sent 100s by now. I
         | see 3-8 violations every time I go out for a 30-60 minute
         | drive.
         | 
         | https://www.latimes.com/california/story/2024-01-25/traffic-...
         | 
         | There are similar articles for SF.
        
           | simonw wrote:
           | It's bad because while you may trust the government right
           | now, there are no guarantees that a government you do NOT
           | trust won't be elected in the future.
           | 
           | Also important to consider that government institutions are
           | made up of individuals. Do you want a police officer who is
           | the abuser in an bad domestic situation being given the power
           | to track their partner using the resources made available to
           | them in their work?
        
             | hackerlight wrote:
             | > It's bad because while you may trust the government right
             | now, there are no guarantees that a government you do NOT
             | trust won't be elected in the future.
             | 
             | Yes, but this ignores the reverse causality component.
             | 
             | If people feel unsafe then the probability that a bad
             | government gets elected goes up. Look at El Salvador.
             | Freedom can't survive if people's basic needs (such as
             | physical safety) aren't met.
             | 
             | The freedom vs safety dichotomy isn't a simple spectrum.
             | There are feedback dynamics.
        
           | chefandy wrote:
           | Sadly, you should disabuse yourself of the notion that our
           | government will only use these powers in our best interest by
           | looking at COINTELPRO, manufactured evidence for invading
           | Iraq, mass incarceration based on nonviolent crimes,
           | surveilling and prosecuting rape victims who live in the
           | wrong jurisdictions for seeking abortions, police treatment
           | of people who speak out against them (they'll have access,
           | too,) the red scare, etc. etc. etc. And that's entirely
           | ignoring what we may be subject to by other governments. Even
           | the increasing polarity between partisan political entities
           | is concerning. If our country is run by someone comfortable
           | with encouraging their supporters to violently put down
           | opposition, do you want them supported by agencies that have
           | access to this stuff? If you are, should everybody else have
           | to be?
           | 
           | One way I gauge where we are is to compare it to what people
           | previously considered problematic. We've witnessed a tectonic
           | shift in the overton window for reasonable surveillance--
           | each incremental change is presented as a reasonable, prudent
           | step that a preponderance of people agree is beneficial.
           | However, if you compiled the changes that have taken place
           | and presented to someone from 1984, for example, they'd be
           | understandably shocked.
           | 
           | For people that have the correct ideas about what to believe,
           | what to say, what to do, and how to do it according to
           | everyone from their municipal jurisdictions to the federal
           | government and all of it's arms, it's probably not a problem.
           | Can we accept the government installing machinery to squash
           | everybody else?
           | 
           | Speeding and red light camera tickets are one thing-- they
           | selectively capture stills of people who have likely
           | committed a crime. Camera networks that track all cars
           | movement by recording license plate sightings are more
           | representative of what the future looks like. Think I'm being
           | paranoid? It's already implemented:
           | https://turnto10.com/news/local/providence-police-
           | department...
           | 
           |  _Edit: again, if you 're going to downvote me, surely you're
           | capable of articulating your opposition in a comment, no?_
        
       | hendry wrote:
       | Can it work on traffic I wonder? Automatic number-plate
       | recognition (ALPR)
        
       | plastic3169 wrote:
       | I was just today thinking that AI assisted editing could be a
       | nice interface. You could watch the image and work mostly by
       | speaking. Computer could pull the images based on description.
       | Make first assembly edit and give alternatives. Ok drop that
       | shot, cut from this shot when the characters eyes leave the
       | frame, replace this take etc. There is something in editing that
       | feels contained enough that in can be described with language.
        
       | 7357 wrote:
       | To me the 'It didn't get all of them' is what makes me think this
       | AI thing is just a toy. Don't get me wrong, it's marvelous as it
       | is, but it only is useful (I use ollama + mistral 7B) when I know
       | nothing, if I do have some understanding of the topic at hand it
       | just becomes plain wrong. Hopefully I will be corrected.
        
         | simonw wrote:
         | Have you spent much time with GPT-4?
         | 
         | I like experimenting with Mistral 7B and Mixtral, but the
         | quality of output from those is still sadly in a different
         | league from GPT-4.
        
       | MyFirstSass wrote:
       | Ok, crazy tangent;
       | 
       | Where agents will potentially become _extremely_ useful
       | /dystopian is when they just silently watch your entire screen at
       | all times. Isolated, encrypted and local preferably.
       | 
       | Imagine it just watching you coding for months, planning stuff,
       | researching things, it could potentially give you personal and
       | professional advice from deep knowledge about you. "I noticed you
       | code this way, may i recommend this pattern" or "i noticed you
       | have signs of this diagnosis from the way you move your mouse and
       | consume content, may i recommend this lifestyle change".
       | 
       | I wonder how long before something like that is feasible, ie a
       | model you install that is constantly updated, but also constantly
       | merged with world data so it becomes more intelligent on two
       | fronts, and can follow as hardware and software advances over the
       | years.
       | 
       | Such a model would be dangerously valuable to corporations / bad
       | actors as it would mirror your psyche and remember so much about
       | you - so it would have to be running with a degree of safety i
       | can't even imagine, or you'd be cloneable or loose all privacy.
        
         | az226 wrote:
         | Rewind.ai
        
           | evaneykelen wrote:
           | I have tried Rewind and found it very disappointing.
           | Transcripts were of very poor quality and the screen capture
           | timeline proved useless to me.
        
             | Falimonda wrote:
             | If it wasn't for the poor transcript quality would you
             | consider Rewind.ai to be valuable enough to use day-to-day?
             | 
             | Could you elaborate on what was useless about the screen
             | capture timeline?
        
         | CamperBob2 wrote:
         | I liked this idea better in THX-1138.
        
           | MyFirstSass wrote:
           | One of the movies i've had on my watch list for far too long,
           | thanks for reminding me.
           | 
           | But yeah, dystopia is right down the same road we're all
           | going right now.
        
             | mdanger007 wrote:
             | Reading The Four by Scott Galloway, Apple, Facebook,
             | Google, and Amazon were dominating the market 7 years ago
             | generating 2.3 trillion in wealth. They're worth double
             | that now.
             | 
             | The Four, especially with its AI, is going to control the
             | market in ways that will have a deep impact on government
             | and society.
        
               | MyFirstSass wrote:
               | Yeah, that's one of the developments i'm unable to spin
               | positively.
               | 
               | As technological society advances the threshold to enter
               | the market with anything not completely laughable becomes
               | exponentially harder, only consolidating old money or the
               | already established right?
               | 
               | What i found so amazing about the early internet, or even
               | just the internet 2.0 was the possibility to create a
               | platform/marketplace/magazine or whatever, and actually
               | have it take off and get a little of the shared growth.
               | 
               | But now it seems all growth has become centralised to a
               | few apps and marketplaces and the barrier to entry is
               | getting harder by the hour.
               | 
               | Ie. being an entrepreneur is harder now because of tech
               | and market consolidation. But potentially mirrored in
               | previous eras like the industrialisation - i'm just not
               | sure we'll get another "reset" like that to allow new
               | players.
               | 
               | Please someone me this is wrong and there's still hope
               | for the tech entrepreneurs / sideprojects!
        
         | system2 wrote:
         | If 7 second video consumed 1k token, I'd assume the budget must
         | be insane to process such prompt.
        
           | yazaddaruvala wrote:
           | Unlikely to be a prompt. It would need to be some form of
           | fine tuning like LORA.
        
           | MyFirstSass wrote:
           | Yeah not feasible with todays methods and rag / lora
           | shenanigans, but the way the field is moving i wouldn't be
           | surprised if new decoder paradigms made it possible.
           | 
           | Saw this yesterday, 1M context window but haven't had any
           | time to look into it, just an example new developments
           | happening every week:
           | 
           | https://www.reddit.com/r/LocalLLaMA/comments/1as36v9/anyone_.
           | ..
        
           | Invictus0 wrote:
           | That's a 7 second video from an HD camera. When recording a
           | screen, you only really need to consider whats changing on
           | the screen.
        
             | nostrebored wrote:
             | That's not true. What content is important context on the
             | screen might change dependent on the new changes.
        
         | zoogeny wrote:
         | Why watch your screen when you could feed in video from a
         | wearable pair of glasses like those Instagram Ray Bans. And why
         | stop at video when you could have it record and learn from a
         | mic that is always on. And you might as well throw in a feed of
         | your GPS location and biometrics from your smart watch.
         | 
         | When you consider it, we aren't very far away from that at all.
        
         | Animats wrote:
         | > Imagine it just watching you coding for months, planning
         | stuff, researching things, it could potentially give you
         | personal and professional advice from deep knowledge about you.
         | 
         | And then announcing "I can do your job now. You're fired."
        
           | ghxst wrote:
           | That's why we would want it to run locally! Think about a
           | fully personalized model that can work out some simple tasks
           | / code while you're going out for groceries, or potentially
           | more complex tasks while you're sleeping.
        
             | underdeserver wrote:
             | It's local to _your employer_ 's computer.
        
               | albumen wrote:
               | Have it running on your personal comp, monitoring a
               | screen-share from your work comp. (But that would
               | probably breach your employment contract re saving work
               | on personal machines.)
        
               | ssl-3 wrote:
               | It can be.
               | 
               | It can also be local to my own computer. People do write
               | software while they're away from work.
        
               | mostlysimilar wrote:
               | Corporations would absolutely force this until it could
               | do your job and then fire you the second they could.
        
           | ChrisClark wrote:
           | That sounds a lot like Learning To Be Me, by Greg Egan. Just
           | not quite as advanced, or inside your head.
        
         | frizlab wrote:
         | I would hate that so much.
        
           | FirmwareBurner wrote:
           | IKR, Who wouldn't want another Clippy constantly nagging you,
           | but this time with a higher IQ and more intimate knowledge of
           | you? /s
        
         | chamomeal wrote:
         | Not crazy! I listened to a software engineering daily episode
         | about pieces.app. Right now it's some dev productivity tool or
         | something, but in the interview the guy laid out a crazy vision
         | that sounds like what you're talking about.
         | 
         | He was talking about eventually having an agent that watches
         | your screen and remembers what you do across all apps, and can
         | store it and share it with you team.
         | 
         | So you could say "how does my teammate run staging builds?" or
         | "what happened to the documentation on feature x that we never
         | finished building", and it'll just _know_.
         | 
         | Obviously that's far away, and it was just the ramblings of
         | excited founder, but it's fun to think about. Not sure if I
         | hate it or love it lol
        
           | jerbear4328 wrote:
           | Being able to ask about stuff other people do seems like it
           | could be ripe with privacy issues, honestly. Even if the
           | model was limited to only recording work stuff, I don't think
           | I would want that. Imagine "how often does my coworker browse
           | to HN during work" or "list examples of dumb mistakes my
           | coworkers have made" for some not-so-bad examples.
        
         | slg wrote:
         | >Isolated, encrypted and local of course.
         | 
         | And what is the likelihood of that "of course" portion actually
         | happening? What is the business model that makes that route
         | more profitable compared to the current model all the leaders
         | in this tech are using in which they control everything?
        
         | foolfoolz wrote:
         | you could design a similar product to do the opposite and
         | anonymize your work automatically
        
         | mixmastamyk wrote:
         | _" It looks like you're writing a suicide note... care for any
         | help?"_
         | 
         | https://www.reddit.com/r/memes/comments/bb1jq9/clippy_is_qui...
        
         | pier25 wrote:
         | > _encrypted and local of course_
         | 
         | Only for people who'd pay for that.
         | 
         | Free users would become the product.
        
           | fillskills wrote:
           | Unless its open sourced :)
        
             | troupo wrote:
             | In modern world open code often doesn't mean much. E.g.
             | Chrome is opensourced. And yet no one really contributes to
             | it or has any say over the direction its going:
             | https://twitter.com/RickByers/status/1715568535731155100
        
               | stavros wrote:
               | Open source isn't meant to give everyone control over a
               | specific project. It's meant to make it so, if you don't
               | like the project, you can fork it and chart your own
               | direction for it.
        
               | DariusKocar wrote:
               | One needs to follow the money to find the true direction.
               | I think the ideal setup is that such a product is owned
               | by a public figure/org who has no vested interest in
               | making money or using it in a way.
        
               | pier25 wrote:
               | Chrome is not open sourced, Chromium is.
        
         | philips wrote:
         | I have a friend building something like that at
         | https://perfectmemory.ai
        
         | DariusKocar wrote:
         | I'm working on this! https://www.perfectmemory.ai/
         | 
         | It's encrypted (on top of Bitlocker) and local. There's all
         | this competition who makes the best, most articulate LLM. But
         | the truth is that off-the-shelf 7B models can put sentences
         | together with no problem. It's the context they're missing.
        
           | crooked-v wrote:
           | I feel like the storage requirements are really going to be
           | these issue for these apps/services that run on "take
           | screenshots and OCR them" functionality with LLMs. If you're
           | using something like this a huge part of the value
           | proposition is in the long term, but until something has a
           | more efficient way to function, even a 1-year history is
           | impractical for a lot of people.
           | 
           | For example, consider the classic situation of accidentally
           | giving someone the same Christmas that you did a few years
           | back. A sufficiently powerful personal LLM that 'remembers
           | everything' could absolutely help with that (maybe even give
           | you a nice table of the gifts you've purchased online, who
           | they were for, and what categories of items would complement
           | a previous gift), but only if it can practically store that
           | memory for a multi-year time period.
        
             | DariusKocar wrote:
             | It's not that bad. With Perfect Memory AI I see ~9GB a
             | month. That's 108 GB/year. HDD/SSDs are getting bigger than
             | that every year. The storage also varies by what you do,
             | your workflow and display resolution. Here's an article I
             | wrote on my finding of storage requirements.
             | https://www.perfectmemory.ai/support/storage-
             | resources/stora...
             | 
             | And if you want to use the data for LLM only, then you
             | don't need to store the screenshots at all. Then it's ~
             | 15MB a month
        
           | smusamashah wrote:
           | Your website and blog are very low on details on how this is
           | working. Downloading and installing an mai directly feels
           | unsafe imo. Especially when I don't know how this software is
           | working. Is it recording a video, performing OCR
           | continuously, taking just screenshots
           | 
           | No mention of using any LLMs in there at all which is how you
           | are presenting it in your comment here.
        
           | milesskorpen wrote:
           | Basically looks like rewind.ai but for the PC?
        
             | cyrux004 wrote:
             | exactly. the UI is shockingly similar
        
         | behat wrote:
         | Heh. Built a macOS app that does something like this a while
         | ago - https://github.com/bharathpbhat/EssentialApp
         | 
         | Back then, I used on device OCR and then sent the text to gpt.
         | I've been wanting to re-do this with local LLMs
        
         | chancemehmu wrote:
         | That's impel - https://tryimpel.com
        
           | dweekly wrote:
           | There's limited information on the site - are you using them
           | or affiliated with them? What's your take? Does it work well?
        
             | chancemehmu wrote:
             | I have been using their beta for the past two weeks and
             | it's pretty good. Like I am watching youtube videos and it
             | just pops up automatically.
             | 
             | I don't know if it's public yet, but they sent me this
             | video with the invite: https://youtu.be/dXvhGwj4yGo
        
           | crooked-v wrote:
           | The "smart tasks" part of it looks like the most compelling
           | part of that to me, but it would have to be REALLY reliable
           | for me to use it. 50% reliability in capturing tasks is about
           | the same as 0% reliability when it comes to actually being a
           | useful part of anything professional.
        
         | oconnor663 wrote:
         | A version of this that seems both easier and less weird would
         | be an AI that listens to you all the time when you're learning
         | a foreign language. Imagine how much faster you could learn,
         | and how much more native you could ultimately get, if you had
         | something that could buzz your watch whenever you said
         | something wrong. And of course you'd calibrate it to understand
         | what level you're at and not spam you constantly. I would love
         | to have something like that, assuming it was voluntary...
        
           | lawlessone wrote:
           | >assuming it was voluntary...
           | 
           | Imagine if it was wrong about something. But every time you
           | tried to submit the bug report it disables your arms via
           | Nueralink.
        
           | lucubratory wrote:
           | I think even aside from the more outlandish ideas like that
           | one, just having a fluent native speaker to talk to as much
           | as you want would be incredibly valuable. Even more valuable
           | if they are smart/educated enough to act as a language
           | teacher. High-quality LLMs with a conversational interface
           | capable of seamless language switching are an absolute killer
           | app for language learning.
           | 
           | A use that seems scientifically possible but technically
           | difficult would be to have an LLM help you engage in
           | essentially immersion learning. Set up something like a
           | pihole, but instead of cutting out ads it intercepts all the
           | content you're consuming (webpages, text, video, images) and
           | translates it to the language you're learning. The idea would
           | be that you don't have to go out and find whole new sources
           | of language to set yourself with a different language's
           | information ecosystem, you can just press a button and
           | convert your current information ecosystem to the language
           | you want to learn. If something like that could be
           | implemented it would be incredibly valuable.
        
       | elzbardico wrote:
       | Really. I am not that impressed. It is not something radically
       | different from doing the same thing with a still photo which by
       | now is trivial for those models.
       | 
       | What is being tested here doesn't require a video. It is not
       | showing to be able to derive any meaning from a short clip. It is
       | fucking doing very fancy OCR, that's all.
       | 
       | What would impress me is if shown a clip of an open chest surgery
       | it was able to comment what surgery is being done, which
       | technique is being used, or if shown video of construction
       | workers, be able to figure out what is the building technique,
       | what they are actually doing, telling that the guy with the
       | yellow shirt is not following safety regulations by not wearing a
       | helmet.
        
       | AI_beffr wrote:
       | he calls this technology "exciting." it makes me shudder. i have
       | been contemplating this for a decade, this specific thing, and
       | now it really is right in front of us. what happens when the
       | useful data within any image or video stream can be extracted
       | into the form of text and descriptions? a model of the world or
       | of a country will emerge that you can hold in your hand. you can
       | know the exact whereabouts of anyone at any time. you can know
       | anything at any time. a real-time model of a country. and AI will
       | be able to digest this model and answer questions about it. any
       | government that has possession of such a system will wield
       | absolute control in a way that has never been possible before. it
       | will have massive implications. liberal democracy will no longer
       | be viable as an economic of political framework. jeff bezos once
       | said that we are essentially lucky that the most efficient way
       | for resources to be utilized is in a decentralized manner. the
       | fact that liberty is the strongest model economically, where
       | everyone acts independently, is a happy coincidence. centralized
       | economies, otherwise known as communism, havent worked in the
       | past but that will change because with the power of AI, and with
       | the real-time model and control-loop that it will make possible,
       | the most efficient way to manage and deploy resources will be
       | with one central management entity. in other words, an advanced
       | AI will do literally everything for us, human labor will be made
       | worthless, and countries that stick to the old ways will simply
       | be made obsolete. inevitably, the AI-driven countries, with their
       | pathetic blobs of parasitic human enclaves hanging off their
       | tits, will move in on the old countries and destroy them for some
       | inane reason such as needing more space to store antimatter.
       | whatever.
       | 
       | even without looking all the way into the future, these AI video
       | and image digesting tools will give birth to new and horrifying
       | possibilities for bad actors in the government. their ability to
       | steam roll over peoples lives in a bureaucratic stupor will be
       | completely out of control. this seems like a sure thing but it
       | doesnt seem likely at all that AI will be proactively and bravely
       | used to counter-balance the negative uses by concerned citizens.
       | people need to open their eyes to the possibility that different
       | levels of technology are like points on a landscape -- not
       | necessarily getting better or worse with time or "progress."
        
         | elzbardico wrote:
         | Man. LLMs are basically auto-complete systems. This scenario
         | you're painting seems too far-fetched for this technology at
         | any timeline you could propose.
        
           | AI_beffr wrote:
           | just five years ago it would be far fetched to suggest that
           | we would have what we have now. its clear that peoples
           | intuition about what is likely and what is not is not
           | accurate right now. and this scenario is actually the
           | opposite of unlikely, its inevitable. the economic forces
           | will not allow any other outcome. its not really surprising
           | when you consider how inefficient market based economies are,
           | how inefficient and fragile humans are, and the fact that
           | communism has already come close to working in the past. even
           | without AI, centralized economies rival decentralized ones.
           | and the loss of human agency that comes with centralized
           | economies cant be dismissed.
        
       | waynesonfire wrote:
       | > It looks like the safety filter may have taken offense to the
       | word "Cocktail"!
       | 
       | how dare you!!! You are not allowed to think that.
       | 
       | It's crazy we are witnessing modern day equivalent of book
       | burning / freedom of speech restrictions. Kind of a bummer. I'm
       | not smart enough to argue freedom of speech and wish someone
       | smarter than me addressed this. Maybe I can ask chatgpt.
        
       | cubefox wrote:
       | So it is only about 256 tokens per image. I think the standard
       | text tokenization method encodes two bytes per token, resulting
       | in around 65.000 different tokens. If the same holds for images,
       | given that they have the same price in the API, that would be
       | just 512 bytes per image. Which seems impossibly low considering
       | that the AI is still able to read those book titles. I don't
       | understand what is going on here.
        
       | loudmax wrote:
       | At the end of the article, a single image of the bookshelf
       | uploaded to Gemini is 258 tokens. Gemini then responds with a
       | listing of book titles, coming to 152 tokens.
       | 
       | Does anyone understand where the information for the response
       | came from? That is, does Gemini hold onto the original uploaded
       | non-tokenized image, then run an OCR on it to read those titles?
       | Or are all those book titles somehow contained in those 258
       | tokens?
       | 
       | If it's the later, it seems amazing that these tokens contain
       | that much information.
        
         | simonw wrote:
         | I would LOVE to understand that myself.
        
         | jacobr1 wrote:
         | I'm not sure about Gemini, but OpenAI GTP-V bills at roughly a
         | token per 40x40px square. It isn't clear to me these actually
         | processed as units, but rather it seems like they tried to
         | approximate the cost structure to match text.
        
         | zacmps wrote:
         | Remember, if it's using a similar tokeniser to GPT-4
         | (cl100k_base iirc), each token has a dimension of ~100,000.
         | 
         | So 258x100,000 is a space of 25,800,000 floats, using f16 (a
         | total guess) that's 51.6kB, probably enough to represent the
         | image at ok quality with JPG.
        
           | simonw wrote:
           | I don't think that's right. A token in GPT-4 is a single
           | integer, not a vector of floats.
           | 
           | Input to a model gets embedded into vectors later, but the
           | actual tokens are pretty tiny.
        
       | Animats wrote:
       | Can you look at the tokens generated from an image?
        
       | andy_xor_andrew wrote:
       | > That 7 second video consumed just 1,841 tokens out of my
       | 1,048,576 token limit.
       | 
       | is this simply an approximation done by Gemini in order to add
       | some artificial limit on the amount of video?
       | 
       | Or do video frames actually equate directly to tokens somehow?
       | 
       | I guess my question is, is there a real relationship between
       | videos and tokens as we understand them (i.e. "hello" is a token)
       | or are they just using the term "tokens" because it's easy for a
       | user to understand, and an image is not literally handled the
       | same way a token is?
        
         | simonw wrote:
         | There's a new section at the bottom of the article about that.
         | 
         | It looks like an image is 258 tokens, and Gemini splits videos
         | into one frame per second and processes those as images.
        
       | msk-lywenn wrote:
       | I like how the article points out the token consumption at each
       | step. Do we have an idea of how much energy is actually used by
       | each token?
        
       | justinclift wrote:
       | As it says the audio is stripped/removed from video before
       | processing, wonder how well it'd do if asked to transcribe by lip
       | reading?
        
       | nostromo wrote:
       | > It looks like the safety filter may have taken offense to the
       | word "Cocktail"! I opened up the safety settings, dialled them
       | down to "low" for every category and tried again. It appeared to
       | refuse a second time.
       | 
       | Google really is its own worst enemy. Their risk management
       | people have completely taken over the organization to a point
       | where somehow the smartest computers ever created are afraid of
       | using dangerous words like "cocktail" or creating dangerous
       | images of people like "Abraham Lincoln."
        
       | Havoc wrote:
       | > 7 second video consumed just 1,841 tokens
       | 
       | How? Video is a massive amount of data
        
         | Alifatisk wrote:
         | I also wonder what the tokens consists of?
        
       | 2sk21 wrote:
       | Can someone give mea reference that describe how exactly
       | multimodal tokens are generated?
        
       | mberning wrote:
       | I can't wait for the closed source and NDA future of everything.
       | It's gonna suck.
        
       | Vicinity9635 wrote:
       | "So Google's new Gemini chatbot is racist as fuck."
       | 
       | https://twitter.com/JoshWalkos/status/1760423141942178037
        
         | lucubratory wrote:
         | This has nothing to do with the model's capabilities and isn't
         | substantially different from the vast majority of mainstream
         | values in content moderation on social media.
        
       | Uptrenda wrote:
       | This guy has zero diversity in his reading interests. Not a
       | single novel. Not a single book that isn't directly related to
       | software engineering. He's like a living parody of a HN meme
       | person. Talking to this guy would be like talking to wallpaper.
        
       | gmuslera wrote:
       | It may end being truly a killer app.
       | 
       | It is already bad for privacy the amount of video that is around,
       | but increasing some orders how fast, easy and scalable may be
       | processing it may increase the amount that is processed, even if
       | is not perfect identifying what is there. And that by different
       | actors, not just governments or intelligence agencies.
       | 
       | Now match that what is happening right now in Palestine in the
       | present or somewhere else in a not so far future.
        
       ___________________________________________________________________
       (page generated 2024-02-21 23:00 UTC)