基础研究 3月前 247 阅读 0 评论

Toward LLMs That Understand Misspellings

作者头像

AI技术专栏作家 | 发布了 246 篇文章


Researchers built a model that’s more robust to noisy inputs like misspellings, smarter about character-level information like the number of R"s in strawberry, and potentially better able to understand unfamiliar languages that might share groups of letters with familiar languages. Their approach: Eliminate the tokenizer and instead integrate a system that learns to group input characters.

What’s new: Artidoro Pagnoni, Ram Pasunuru, and collaborators at Meta, University of Washington, and University of Chicago introduced Byte Latent Transformer (BLT), a system of transformers that processes groups of text characters (in the form of bytes) directly.

Key insight: A tokenizer turns bytes (characters) into tokens (a word or part of a word) based on learned rules: Specific sequences map to particular tokens. A large language model (LLM) would be more efficient if its tokenizer considered how easy or difficult it would be to predict the next token, because then it could group tokens that commonly occur together, thus saving memory and processing power. For instance, to complete the phrase, “The capital of the United States is,” a tokenizer may generate “Washington”, then “D”, then “.C”, and finally “.” — even though it’s easy to predict that “D.C.” will follow “Washington” (that is, the number of viable options is very small). Conversely, generating the token after “D.C.” is harder, since many viable options exist. Using a small LLM to estimate the difficulty of predicting the next token enables the model to split difficult-to-predict text into smaller groups while packing easier-to-predict text into larger groups.

How it works: BLT comprises four transformers (8 billion parameters total): (i) a small byte-level transformer, (ii) an encoder transformer, (iii) a so-called latent transformer, and (iv) a decoder transformer. The authors trained the system to generate the next token in 1 trillion tokens of text, including tokens drawn from a filtered version of Common Crawl.

  • The authors trained the byte-level transformer to generate the next byte from an input sequence of bytes.
  • For an input sequence, the byte-level transformer predicted the probabilities of the value of the next byte. The authors used entropy, a measure of uncertainty, to decide how bytes should be grouped. If the predicted probabilities were concentrated in a particular byte value (low entropy), meaning the next byte was highly predictable, the byte was added to the current group. If the probabilities were more spread out across multiple byte values (high entropy), meaning the model was less certain, it was part of a new group.
  • The encoder transformer learned to represent each group as a vector, while attending to preceding bytes for context.
  • The latent transformer learned to generate the next group vector from all previous group vectors.
  • Finally, the decoder transformer learned to reconstruct a byte sequence from a sequence of vectors.

Results: On seven benchmarks that test general language and coding abilities, BLT achieved an average accuracy of 61.1 percent, outperforming Llama 3 (8 billion parameters and a similar number of floating point operations to BLT) at 60.0 percent.

  • BLT achieved 80.6 percent on the common-sense question and answer benchmark HellaSwag, while Llama 3 (8 billion parameters and a similar number of floating point operations to BLT) achieved 79.1 percent.
  • BLT demonstrated significantly higher resilience to noisy inputs compared to Llama 3, particularly in tasks involving character manipulation, spelling variations, and languages for which relatively little data is available. For example, in the CUTE spelling benchmark, which tests a model’s ability to recognize correctly spelled words, BLT achieved 99.9 percent accuracy while Llama 3 achieved 1.1 percent accuracy.
  • BLT outperformed Llama 3 in translating to English across 26 languages (including 20 with little data). It achieved 14.0 average SentencePiece BLEU score (which measures how good a machine translation is compared to a human translation over text tokenized with the SentencePiece tokenizer), while LLaMA 3 achieved 12.1 average SentencePiece BLEU.

Why it matters: By working directly on bytes, BLT is inherently more robust to variations in language, which improves its performance. For instance, when prompted to insert a "z" after every "n" in "not", Llama 3 incorrectly completed it as "znotz". This happened because its tokenizer treats "not" as a single, indivisible token. In contrast, BLT correctly generated "nzot," because it can dynamically regroup bytes and draw new boundaries. In a more practical case, instead of treating "pizya" and "pizza" as different tokens, BLT recognizes that they share nearly identical byte sequences, differing only in the bytes for "y" and "z", and therefore likely mean the same thing.

We’re thinking: In some alternatives to traditional tokenization, an LLM might process much longer sequences because the number of bytes in a sentence is much larger than the number of words. This work addresses that issue by grouping bytes dynamically. The tradeoff is complexity: Instead of one transformer, we have four.

作者头像

AI前线

专注人工智能前沿技术报道,深入解析AI发展趋势与应用场景

246篇文章 1.2M阅读 56.3k粉丝

评论 (128)

用户头像

AI爱好者

2小时前

这个更新太令人期待了!视频分析功能将极大扩展AI的应用场景,特别是在教育和内容创作领域。

用户头像

开发者小明

昨天

有没有人测试过新的API响应速度?我们正在开发一个实时视频分析应用,非常关注性能表现。

作者头像

AI前线 作者

12小时前

我们测试的平均响应时间在300ms左右,比上一代快了很多,适合实时应用场景。

用户头像

科技观察家

3天前

GPT-4的视频处理能力已经接近专业级水平,这可能会对内容审核、视频编辑等行业产生颠覆性影响。期待看到更多创新应用!