Tokenisation is NP-Complete

  • 2024-12-19 18:59:46
  • Philip Whittington, Gregor Bachmann, Tiago Pimentel
  • 0

Abstract

In this work, we prove the NP-completeness of two variants of tokenisation,defined as the problem of compressing a dataset to at most $\delta$ symbols byeither finding a vocabulary directly (direct tokenisation), or selecting asequence of merge operations (bottom-up tokenisation).