Karpathy on DeepSeek-OCR paper: Are pixels better inputs to LLMs than text?

282 points by JnBrymn 2 days ago

https://xcancel.com/karpathy/status/1980397031542989305

js8 4 hours ago

Not pixels, but percels. Pixels are points in the image, while a "percel" is unit of perceptual information. It might be a pixel with an associated sound, in a given moment of time. In case of humans, percels include other senses as well, and they can also be annotated with your own thoughts (i.e. percels can also include tokens or embeddings).

Of course, NNs like LLM never process a percel in isolation, but always as a group of neighboring percels (aka context), with an initial focus on one of the percels.

tcdent 9 hours ago

"Kill the tokenizer" is such a wild proposition but is also founded in fundamentals.

Tokenizing text is such a hack even though it works pretty well. The state-of-the-art comes out of the gate with an approximation for quantifying language that's wrong on so many levels.

It's difficult to wrap my head around pixels being a more powerful representation of information, but someone's gotta come up with something other than tokenizer.

dgently7 8 hours ago

I consume all text as images when I read as a vision capable person so it kinda passes the evolution does it that way test and maybe we shouldn’t be that surprised that vision is a great input method?
Actually thinking more about that I consume “text” as images and also as sounds… I kinda wonder if instead of render and ocr like this suggests we did tts and just encoded like the mp3 sample of the vocalization of the word if that would be less bytes than the rendered pixels version… probably depends on the resolution / sample rate.
- visarga 7 hours ago
  
  Funny, I habitually read while engaging TTS on same text. I have even made a Chrome extension for web reading, it highlights text and reads it, while keeping the current position in the viewport. I find using 2 modalities at the same time improves my concentration. TTS is sped up to 1.5x to match reading speed. Maybe it is just because I want to reduce visual strain. Since I consume a lot of text every day, it can be tiring.
  - gavinray an hour ago
    
    Any chance you could share the source?
    I found that I can read better if individual words or chunks are highlighted in alternating pastel colors while I scan then with my eyes
  - Version467 5 hours ago
    
    I do this too. It's great. The term I've seen used to describe this is 'Immersion Reading'. It seems to be quite a popular way for neurodivergent people to get into reading.
  - lukevp 6 hours ago
    
    What’s your extension? Sounds interesting!
    
    zirror 4 hours ago
    
    Just FYI, Firefox reader mode does the same thing. It's a little button in the address bar.
- psadri 6 hours ago
  
  The pixel to sounds would pass through “reading” so there might be information loss. It is no longer just pixels.
Tarq0n 42 minutes ago

Ok but what are you going to decode into at generation time, a jpeg of text? Tokens have value beyond how text appears to the eye, because we process text in many more ways than just reading it.
317070 5 hours ago

There was the Byte Latent Transformer, to end the tokenizer, which seemingly went nowhere. https://ai.meta.com/research/publications/byte-latent-transf...
- htrp 3 hours ago
  
  fair team currently subject to tbd labs politics
ReptileMan an hour ago

I guess it is because of the absurdly high information density of text - so text is quite a good input.

a_bonobo 4 hours ago

Somewhat related:

There's this older paper from Lex Flagel and others where they transform DNA-based text, stuff we'd normally analyse via text files, into images and then train CNNs on the images. They managed to get the CNNs to re-predict population genetics measurements we normally get from the text-based DNA alignments.

https://academic.oup.com/mbe/article/36/2/220/5229930

orliesaurus 8 hours ago

one of the MOST interesting aspects of the recent discussion on this topic is how it underscores our reliance on lossy abstractions when representing language for machines. Tokenization is one such abstraction, but it's not the only one.... using raw pixels or speech signals is a different kind of approximation. what excites me about experiments like this is not so much that we'll all be handing images to language models tomorrow, but that researchers are pressure testing the design assumptions of current architectures. Approaches that learn to align multiple modalities might reveal better latent structures or training regimes, and that could trickle back into more efficient text encoders without throwing away a century of orthography. BUT there’s also a rich vein to mine in scripts and languages that don’t segment neatly into words: alternative encodings might help models handle those better.

bni 7 hours ago

Of course PowerPoint is the best input to LLMs. They will come to that eventually.

jtwaleson 6 hours ago

It's slides all the way down. Once models support this natively, it's a major threat to slides ai / gamma and the careers of product managers.
brokencode 6 hours ago

I'd actually prefer to communicate to ChatGPT via Microsoft Paint. Much more efficient than typing.
- saaaaaam 2 hours ago
  
  Leading scientists claim interpretative dance is the AI breakthrough the world has been waiting for!
amelius 2 hours ago

Clippy knew this all along.
cat5e 6 hours ago

Yeah, I’ve seen great results with this approach.

nl 9 hours ago

Kapathy's points are correct (of course).

One thing I like about text tokens though is that it learns some understanding of the text input method (particularly the QWERTY keyboard).

"Hello" and "Hwllo" are closer in semantic space than you'd think because "w" and "e" are next to each other.

This is much easier to see in hand coded spelling models, where you can get better results by including a "keybaord distance" metric along with a string distance metric.

harperlee 5 hours ago

But assuming that pixel input gets us to an AI capable of reading, they would presumably also be able to detect HWLLO as semantically close to HELLO (similarly to H3LL0, or badly handwritten text - although there would be some graphical structure in these latter examples to help). At the end of the day we are capable of identifying that... Might require some more training effort but the result would be more general.
swyx 9 hours ago

im particularly sympathetic to typo learning, which i think gets lost in the synthetic data discussion (mine here https://www.youtube.com/watch?v=yXPPcBlcF8U )
but i think in this case you can still generate typos in images and it'd be learnable. not a hard issue relevant to the OP

koushikn 3 hours ago

Is it feasible that if we have a tokeniser that works on ELF (or PE/COFF) binaries, then we could have LLMs trained on existing binaries and have them generate binary code directly, skipping the need for programming languages?

anon291 an hour ago

Possible but not precise depending on your use case. LLM compilers would suffer from the same sort of propensity to bugs as humans.

sabareesh 13 hours ago

It might be that our current tokenization is inefficient compared to how well image pipeline does. Language already does lot of compression but there might be even better way to represent it in latent space

ACCount37 13 hours ago

People in the industry know that tokenizers suck and there's room to do better. But actually doing it better? At scale? Now that's hard.
- typpilol 12 hours ago
  
  It will require like 20x the compute
  - ACCount37 11 hours ago
    
    A lot of cool things are shot down by "it requires more compute, and by a lot, and we're already compute starved on any day of the week that ends in y, so, not worth it".
    If we had a million times the compute? We might have brute forced our way to AGI by now.
    
    Jensson 10 hours ago
    
    But we don't have a million times the compute, we have the compute we have so its fair to argue that we want to prioritize other things.
  - kenjackson 11 hours ago
    
    Why so much compute? Can you tie it to the problem?
    
    typpilol 4 hours ago
    
    Tokenizers are the reason LLMs are even possible to run at a decent speed on our best hardware.
    Removing the tokenizer would 1/4 the context and 4x the compute and memory, assuming an avg token length of 4.
    Also, you would probably need to 4x the parameters to have to learn understanding between individual characters as well as words and sentences etc.
    There's been a few studies on small models, even then those only show a tiny percentage gain over tokenized models.
    So essentially you would need 4x compute, 1/4 the context, and 4x the parameters to squeeze 2-4% more performance out of it.
    And that fails when you use more then 1/4 context. So realistically you need to support the same context, so you r compute goes up another 4x to 16x.
    That's why
  - Mehvix 11 hours ago
    
    Why do you suppose this is a compute limited problem?
    
    ACCount37 10 hours ago
    
    It's kind of a shortcut answer by now. Especially for anything that touches pretraining.
    "Why aren't we doing X?", where X is a thing that sounds sensible, seems like it would help, and does indeed help, and there's even a paper here proving that it helps.
    The answer is: check the paper, it says there on page 12 in a throwaway line that they used 3 times the compute for the new method than for the controls. And the gain was +4%.
    A lot of promising things are resource hogs, and there are too many better things to burn the GPU-hours on.
    
    typpilol 4 hours ago
    
    Thanks.
    Also, saying it needs 20x compute is exactly that. It's something we could do eventually but not now
CuriouslyC 13 hours ago

Image models use "larger" tokens. You can get this effect with text tokens if you use a larger token dictionary and generate common n-gram tokens, but the current LLM architecture isn't friendly to large output distributions.
- yorwba 8 hours ago
  
  You don't have to use the same token dictionary for input and output. There's things like simultaneously predicting multiple tokens ahead as an auxiliary loss and for speculative decoding, where the output is larger than the input, and similarly you could have a model where the input tokens combine multiple output tokens. You would still need to do a forward pass per output token during autoregressive generation, but prefill would require fewer passes and the KV cache would be smaller too, so it could still produce a decent speedup.
  But in the DeepSeek-OCR paper, compressing more text into the same number of visual input tokens leads to progressively worse output precision, so it's not a free lunch but a speed-quality tradeoff, and more fine-grained KV cache-compression methods might deliver better speedups without degrading the output as much.
- mark_l_watson 8 hours ago
  
  Interesting idea! Haven’t heard that before.

bob1029 2 hours ago

I think the DCT is a compelling way to interact with spatial information when the channel is constrained. What works for jpeg can likely work elsewhere. The energy compaction properties of the DCT mean you get most of the important information in a few coefficients. A quantizer can zero out everything else. Zig zag scanned + RLE byte sequences could be a reasonable way to generate useful "tokens" from transformed image blocks. Take everything from jpeg encoder except for perhaps the entropy coding step.

At some level you do need something approximating a token. BPE is very compelling for UTF8 sequences. It might be nearly the most ideal way to transform (compress) that kind of data. For images, audio and video, we need some kind of grain like that. Something to reorganize the problem and dramatically reduce the information rate to a point where it can be managed. Compression and entropy is at the heart of all of this. I think BPE is doing more heavy lifting than we are giving it credit for.

I'd extend this thinking to techniques like MPEG for video. All frame types also use something like the DCT too. The P and B frames are basically the same ideas as the I frame (jpeg), the difference is they take the DCT of the residual between adjacent frames. This is where the compression gets to be insane with video. It's block transforms all the way down.

An 8x8 DCT block for a channel of SDR content is 512 bits of raw information. After quantization and RLE (for typical quality settings), we can get this down to 50-100 bits of information. I feel like this is an extremely reasonable grain to work with.

jacquesm an hour ago

I can listen to music in my head. I don't think this is an extraordinary property but it is kind of neat. That hints at the fact that I somehow must have encoded this music. I can't imagine I'm storing the equivalent of a MIDI file, but I also can't imagine that I'm storing raw audio samples because there is just too much of it.
It seems to work for vocals as well, not just short samples but entire works. Of course that's what I think, but there is a pretty good chance they're not 'entire', but it's enough that it isn't just excerpts and if I was a good enough musician I could replicate what I remember.
Is there anybody that has a handle on how we store auditory content in our memories? Is it a higher level encoding or a lower level one? This capability is probably key in language development so it is not surprising that we should have the capability to encode (and replay) audio content, I'm just curious about how it works, what kind of accuracy is normally expected and how much of such storage we have.
Another interesting thing is that it is possible to search through it fairly rapidly to match a fragment heard to one that I've heard and stored before.

dang 13 hours ago

Recent and related:

Getting DeepSeek-OCR working on an Nvidia Spark via brute force with Claude Code - https://news.ycombinator.com/item?id=45646559 - Oct 2025 (43 comments)

DeepSeek OCR - https://news.ycombinator.com/item?id=45640594 - Oct 2025 (238 comments)

viraptor 10 hours ago

https://xcancel.com/karpathy/status/1980397031542989305

kirubakaran 10 hours ago

Thanks. There are also these:
- https://addons.mozilla.org/en-US/firefox/addon/toxcancel/
- https://chromewebstore.google.com/detail/xcancelcom-redirect...
dang 7 hours ago

Thanks! Added to toptext also.

shikon7 8 hours ago

Seems we're now at a point of time when OCR is doing so well, that printing text out and letting computers literally read it is suggested to be superior to processing the endoded text directly.

Legend2440 4 hours ago

Neural networks have essentially solved perception. It doesn't matter what format your data comes in, as long as you have enough of it to learn the patterns.
programmarchy 8 hours ago

PDF is arguably a confusing format for LLMs to read.

hunglee2 5 hours ago

Really interesting analysis on the latest DeepSeek innovation. I’m tempted to connect it to the information density of logographic script, which DeepSeek engineers would all be natively fluent.

ianbutler 11 hours ago

https://arxiv.org/abs/2510.17800 (Glyph: Scaling Context Windows via Visual-Text Compression)

You can also see this paper from the GLM team where they explicitly test this assumption to some pretty good results.

scotty79 9 hours ago

I couldn't imagine how rendering text tokens to images could bring any savings, but then I remembered esch token is converted into hundreds of floating point numbers before feeding it to neural network. So in a way it's already rendered into a multidimensional pixel (or hundreds of arbitrary 2-dimensional pixels). This papers shows that you don't need that many numbers to keep the accuracy and that using numbers that represent the text visually (which is pretty chaotic) is just as good as the way we currently do it.

bahmboo 2 hours ago

Not criticizing per se but I just watched this recent (and great!) interview where he extols how special written language is. That was my take away at least. Still trying to wrap my head around this vision encoder approach. He’s way smarter than me! https://youtu.be/lXUZvyajciY

antirez 4 hours ago

This should be "pixels are (maybe) a better representation than the current representation of tokens". Which is very different. Text is surely more information dense than the image containing the same text, so the problem is finding the best representation of text. If each word is expanded to a very large embedding and you see pixels doing better, than the problem is in the representation and not in the text vs image.

nottorp an hour ago

The text should be printed and a photo of the printed paper on a wooden table should be passed as input into the LLM.

seydor 5 hours ago

we re going to get closer and closer to removing all hand-engineered features of neural network architecture, and letting a giant all-to-all fully connected network collapse on its own to the appropriate architecture for the data, a true black box.

redbell 3 hours ago

For reference, here's the paper: https://github.com/deepseek-ai/DeepSeek-OCR/blob/main/DeepSe...

hbarka 13 hours ago

Chinese writing is logographic. Could this be giving Chinese developers a better intuition for pixels as input rather than text?

anabis 11 hours ago

Yeah, mapping chinese characters to linear UTF-8 space is throwing a lot of information away. Each language brings some ideas for text processing. sentencepiece inventor is Japanese, which doesn't have explicit word delimiters, for example.
hobofan 5 hours ago

Yeah, that sounds quite interesting. I'm wondering whether there is a bigger gap in performance (= quality) between text-only<->vision OCR in Chinese language than in English.
There is indeed a lot of semantic information contained in the signs that should help an LLM. E.g. there is a clear visual connection between 木 (wood/tree) and 林 (forest), while an LLM that purely has to draw a connection between "tree" and "forest" would have a much harder time seeing that connection independent of whether it's fed that as text or vision tokens.
est 4 hours ago

Chinese text == Method of loci
Many Chinese student have good memory to recall a particular paragraph, understand the meaning, but no idea how those words were pronouced.

alexchamberlain 5 hours ago

I'm probably one of the least educated software engineers on LLMs, so apologies if this is a very naive question. Has anyone done any research into just using words as the tokens rather than (if I understand it correctly) 2-3 characters? I understand there would be limitations with this approach, but maybe the models would be smaller overall?

murkt 5 hours ago

You will need dictionaries with millions of tokens, which will make models much larger. Also, any word that has too low frequency to appear in the dictionary is now completely unknown to your model.
mhuffman 5 hours ago

Along with the other commenter, the reason the dictionary would start getting so big is that words with a stem would have all its variations being different tokens (cat, cats, sit, sitting, etc). Also any out-of-dictionary words or combo words, eg. "cat bed" would not be able to be addressed.

pcwelder 5 hours ago

There are many unicode characters that look alike. There are also those zero width characters.

anon291 an hour ago

I made exactly this point at the inaugural Portland AI tinkerers meetup. I had been messing with large document understanding. Converting PDF to text and then sending to gpt was too expensive. It was cheaper to just upload the image and ask it questions directly. And about as accurate.

https://portland.aitinkerers.org/talks/rsvp_fGAlJQAvWUA

cnxhk 11 hours ago

The paper is quite interesting but efficiency on OCR tasks does not mean it could be plugged into a general llm directly without performance loss. If you train a tokenizer only on OCR text you might be able to get better compression already.

yunwal 2 days ago

> The more interesting part for me (esp as a computer vision at heart who is temporarily masquerading as a natural language person) is whether pixels are better inputs to LLMs than text. Whether text tokens are wasteful and just terrible, at the input.

> Maybe it makes more sense that all inputs to LLMs should only ever be images.

So, what, every time I want to ask an LLM a question I paint a picture? I mean at that point why not just say "all input to LLMs should be embeddings"?

fspeech 13 hours ago

If you can read your input on your screen your computer apparently knows how to convert your texts to images.
smegma2 2 days ago

No? He’s talking about rendered text
- rhdunn 13 hours ago
  
  From the post he's referring to text input as well:
  > Maybe it makes more sense that all inputs to LLMs should only ever be images. Even if you happen to have pure text input, maybe you'd prefer to render it and then feed that in:
  Italicized emphasis mine.
  So he's suggesting that/wondering if the vision model should be the only input to the LLM and have that read the text. So there would be a rasterization step on the text input to generate the image.
  Thus, you don't need to draw a picture but generate a raster of the text to feed it to the vision model.
CuriouslyC 13 hours ago

All inputs being embeddings can work if you have embedding like Matryoshka, the hard part is adaptively selecting the embedding size for a given datum.
awesome_dude 7 hours ago

I mean, text is, after all, highly stylised images
It's trivial for text to be pasted in, and converted to pixels (that's what my, and every computer on the planet, does when showing me text)

varispeed 13 hours ago

Text is linear, whereas image is parallel. I mean when people often read they don't scan text from left to right (or different direction, depending on language), but rather read the text all at once or non-linearly. Like first lock on keywords and then read adjacent words to get meaning, often even skipping some filler sentences unconsciously.

Sequential reading of text is very inefficient.

sosodev 12 hours ago

LLMs don't "read" text sequentially, right?
- olliepro 12 hours ago
  
  The causal masking means future tokens don’t affect previous tokens embeddings as they evolve throughout the model, but all tokens a processed in parallel… so, yes and no. See this previous HN post (https://news.ycombinator.com/item?id=45644328) about how bidirectional encoders are similar to diffusion’s non-linear way of generating text. Vision transformers use bidirectional encoding b/c of the non-causal nature of image pixels.
  - Merik 11 hours ago
    
    Didn’t anthropic show that the models engage in a form of planning such that it is predicting a possible future subsequent tokens that then affects prediction of the next token: https://transformer-circuits.pub/2025/attribution-graphs/bio...
    
    ACCount37 10 hours ago
    
    Sure, an LLM can start "preparing" for token N+4 at token N. But that doesn't change that the token N can't "see" N+1.
    Causality is enforced in LLMs - past tokens can affect future tokens, but not the other way around.
- anon291 an hour ago
  
  If the attention is masked, then yes they do.
jb1991 9 hours ago

I think you’re making a lot of assumptions about how people read.
- com2kid 7 hours ago
  
  He isn't, plenty of studies have been done on the topic. Eyes dart around a lot when reading.
  - jb1991 6 hours ago
    
    People do skip words or scan for key phrases, but reading still happens in sequence. The brain depends on word order and syntax to make sense of text, so you cannot truly read it all at once. Skimming just means you sample parts of a linear structure, not that reading itself is non-linear. Eye-tracking studies confirm this sequential processing (check out the Rayner study in Psychological Bulletin if you are interested).
    
    com2kid 42 minutes ago
    
    Thanks for the reference!
    Reading is def not 100% linear, as I find myself skipping ahead to see who is talking or what type of sentence I am reading (question, exclamation, statement).
    There is an interesting discussion down thread about ADHD and sequential reading. As someone who has ADHD I may be biased by how my brain works. I definitely don't read strictly linearly, there is a lot of jumping around and assembling of text.
spiralcoaster 11 hours ago

What people do you know that do this? I absolutely read in a linear fashion unless I'm deliberately skimming something to get the gist of it. Who can read the text "all at once"?!
- numpad0 10 hours ago
  
  I don't know how common it is, but I tend to read novels in a buttered heterogeneous multithreading mode - image and logical and emotional readings all go at each their own paces, rather than a singular OCR engine feeding them all with 1D text
  is that crazy? I'm not buying it is
  - alwa 8 hours ago
    
    That description feels relatable to me. Maybe buffered more than buttered, in my case ;)
    It seems to me that would be a tick in the “pro” column for this idea of using pixels (or contours, a la JPEG) as the models’ fundamental stimulus to train against (as opposed to textual tokens). Isn’t there a comparison to be drawn between the “threads” you describe here, and the multi-headed attention mechanisms (or whatever it is) that the LLM models use to weigh associations at various distances between tokens?
  - bigbluedots 10 hours ago
    
    Don't know, probably? I'm a linear reader
- ants_everywhere 9 hours ago
  
  I do this. I'm autistic and have ADHD so I'm not representative of the normal person. However, I don't think this is entirely uncommon.
  The relevant technical term is "saccade"
  > ADHD: Studies have shown a consistent reduction in ability to suppress unwanted saccades, suggesting an impaired functioning of areas like the dorsolateral prefrontal cortex.
  > Autism: An elevated number of antisaccade errors has been consistently reported, which may be due to disturbances in frontal cortical areas.
  https://eyewiki.org/Saccade
  Also see https://en.wikipedia.org/wiki/Eye_movement_in_reading
  - alwa 9 hours ago
    
    I do this too. I suspect it may involve a subtly different mechanism from the saccade itself though? If the saccade is the behavior, and per the eyewiki link skimming is a voluntary type of saccade, there’s still the question of what leads me to use that behavior when I read (and others to read more linearly). Although you could certainly watch my eyes “saccade” around as I move nonlinearly through a passage, I’m not sure it’s out of a lack of control.
    Rather, I feel like I absorb written meaning in units closer to paragraphs than to words or sentences. I’d describe my rapid up-and-down, back-and-forth eye motions as something closer to going back to soak up more, if that makes sense. To reinterpret it in the context of what came after it. The analogy that comes to mind is to a Progressive JPEG getting crisper as more loads.
    That eyewiki entry was really cool. Among the unexpectedly interesting bits:
    > The initiation of a saccade takes about 200 milliseconds[4]. Saccades are said to be ballistic because the movements are predetermined at initiation, and the saccade generating system cannot respond to subsequent changes in the position of the target after saccade initiation[4].
ants_everywhere 9 hours ago

some of us with ADHD just kind of read all the words at once

hiddencost 9 hours ago

Back before transformers, or even LSTMs, we used to joke that image recognition was so far ahead of language modeling that we should just convert our text to PDF and run the pixels through a CNN.

jimdavid 8 hours ago

[dead]

dgfitz 10 hours ago

[flagged]

scotty79 9 hours ago

It's kind of beautiful that they can actually do that.