• GenderNeutralBro@lemmy.sdf.org
    link
    fedilink
    English
    arrow-up
    5
    ·
    7 months ago

    That’s somewhat awkward phrasing but I think the visual processing will also be done on-device. There are a few small multimodal models out there. Mozilla’s llamafile project includes multimodal support, so you can query a language model about the contents of an image.

    Even just a few months ago I would have thought this was not viable, but the newer models are game-changingly good at very small sizes. Small enough to run on any decent laptop or even a phone.