Tokenization in NLP¶

Breaking text into meaningful units

What is Tokenization?¶

Tokenization is the process of breaking text into smaller units called tokens.

We will get a quick overview of the following. The goal is to get you to appreciate the complexity of language.

  • Word tokenization
  • Character tokenization
  • Byte-based tokenization
  • BPE (Byte Pair Encoding) tokenization
  • Sentence tokenization

Sample Text¶

Let's define a sample text to work with:

In [1]:
text = "Communication & Intelligence is awesome!"

print("Text:", text)
Text: Communication & Intelligence is awesome!

Word Tokenization¶

Breaking text into words or word-like units

But what are words?

Rule-based: Simple .split()¶

The simplest approach: split on whitespace

In [2]:
# Simple whitespace split
simple_tokens = text.split()

print(f"Text: {text}\n")
print(f"Simple .split() tokens ({len(simple_tokens)}):")
print(simple_tokens)
Text: Communication & Intelligence is awesome!

Simple .split() tokens (5):
['Communication', '&', 'Intelligence', 'is', 'awesome!']

Problem: Punctuation stays attached to words!

  • "awesome!" should probably be ["awesome", "!"]

Rule-based: Regular Expressions¶

We can use regex to split on non-word characters:

In [3]:
import re

# Split on word boundaries, keeping only word characters
regex_tokens = re.findall(r'\w+', text)

print(f"Text: {text}\n")
print(f"Regex r'\\w+' tokens ({len(regex_tokens)}):")
print(regex_tokens)
Text: Communication & Intelligence is awesome!

Regex r'\w+' tokens (4):
['Communication', 'Intelligence', 'is', 'awesome']

NLTK Word Tokenizer¶

NLTK provides a more sophisticated tokenizer that handles punctuation:

In [4]:
import nltk
from nltk.tokenize import word_tokenize
nltk.download('punkt', quiet=True)
nltk.download('punkt_tab', quiet=True)

nltk_tokens = word_tokenize(text)

print(f"Text: {text}\n")
print(f"NLTK word_tokenize() tokens ({len(nltk_tokens)}):")
print(nltk_tokens)
Text: Communication & Intelligence is awesome!

NLTK word_tokenize() tokens (6):
['Communication', '&', 'Intelligence', 'is', 'awesome', '!']

Notice:

  • Punctuation is now separated: "awesome!" becomes ['awesome', '!']
  • "&" is kept as its own token

Other remaining issues¶

  • Lowercase/uppercase? (Communication vs communication)
  • Contractions? (don't → do n't? don t? dont?)
  • Hyphenated words? (state-of-the-art → 1 token or 5?)
  • Possessives? (John's → John 's? Johns?)
  • Numbers/currency? ($29.99, 1,000,000)
  • URLs/emails? (https://uchicago.edu)
  • Multi-word expressions? (New York, ice cream)

Character Tokenization¶

Breaking text into individual characters

Use cases:

  • Character-level language models
  • Handling rare/unknown words
  • Languages without clear word boundaries (e.g., Chinese)
In [5]:
# Character tokenization - just use list()
char_tokens = list(text)

print(f"Text: {text}")
print(f"\nCharacter tokens ({len(char_tokens)}):")
print(char_tokens)
Text: Communication & Intelligence is awesome!

Character tokens (40):
['C', 'o', 'm', 'm', 'u', 'n', 'i', 'c', 'a', 't', 'i', 'o', 'n', ' ', '&', ' ', 'I', 'n', 't', 'e', 'l', 'l', 'i', 'g', 'e', 'n', 'c', 'e', ' ', 'i', 's', ' ', 'a', 'w', 'e', 's', 'o', 'm', 'e', '!']

Character vs Word Vocabulary Size¶

In [6]:
# Compare vocabulary sizes
all_words = word_tokenize(text.lower())
all_chars = list(text.lower())

unique_words = set(all_words)
unique_chars = set(all_chars)

print(f"Unique words: {len(unique_words)}")
print(f"Unique characters: {len(unique_chars)}")
print(f"\nUnique characters: {sorted(unique_chars)}")
Unique words: 6
Unique characters: 16

Unique characters: [' ', '!', '&', 'a', 'c', 'e', 'g', 'i', 'l', 'm', 'n', 'o', 's', 't', 'u', 'w']

Byte-Based Tokenization¶

Breaking text into bytes (UTF-8 encoding)

Advantages:

  • Universal: works for any language
  • Fixed vocabulary size (256 possible byte values)
  • Foundation for BPE tokenization used in GPT models
In [7]:
# Byte tokenization on our sample text
byte_tokens = list(text.encode('utf-8'))

print(f"Text: {text}")
print(f"\nByte tokens ({len(byte_tokens)}):")
print(byte_tokens)
Text: Communication & Intelligence is awesome!

Byte tokens (40):
[67, 111, 109, 109, 117, 110, 105, 99, 97, 116, 105, 111, 110, 32, 38, 32, 73, 110, 116, 101, 108, 108, 105, 103, 101, 110, 99, 101, 32, 105, 115, 32, 97, 119, 101, 115, 111, 109, 101, 33]
In [8]:
# Each ASCII letter is 1 byte
# Show bytes in hex for readability
print("In hexadecimal:")
print([hex(b) for b in byte_tokens])
In hexadecimal:
['0x43', '0x6f', '0x6d', '0x6d', '0x75', '0x6e', '0x69', '0x63', '0x61', '0x74', '0x69', '0x6f', '0x6e', '0x20', '0x26', '0x20', '0x49', '0x6e', '0x74', '0x65', '0x6c', '0x6c', '0x69', '0x67', '0x65', '0x6e', '0x63', '0x65', '0x20', '0x69', '0x73', '0x20', '0x61', '0x77', '0x65', '0x73', '0x6f', '0x6d', '0x65', '0x21']

Multilingual Text and Emojis¶

UTF-8 uses variable-length encoding:

  • ASCII (English letters): 1 byte each
  • Chinese characters: 3 bytes each
  • Emojis: 4 bytes each
In [9]:
# Multilingual example
sample = "Hello 世界 😀"

byte_tokens = list(sample.encode('utf-8'))
print(f"Text: {sample}")
print(f"Total bytes: {len(byte_tokens)}")
print(f"\nBreakdown:")
print(f"  'Hello ' = {len('Hello '.encode('utf-8'))} bytes (ASCII)")
print(f"  '世界'   = {len('世界'.encode('utf-8'))} bytes (Chinese, 3 bytes each)")
print(f"  ' 😀'   = {len(' 😀'.encode('utf-8'))} bytes (space + emoji, 4 bytes)")
Text: Hello 世界 😀
Total bytes: 17

Breakdown:
  'Hello ' = 6 bytes (ASCII)
  '世界'   = 6 bytes (Chinese, 3 bytes each)
  ' 😀'   = 5 bytes (space + emoji, 4 bytes)
In [10]:
# Compare tokenization methods on multilingual text
print(f"Text: {sample}\n")

print("Word tokenization:")
words = word_tokenize(sample)
print(f"  {len(words)} tokens: {words}")

print("\nCharacter tokenization:")
characters = list(sample)
print(f"  {len(characters)} tokens: {characters}")

print("\nByte tokenization:")
byte_tokens = list(sample.encode('utf-8'))
print(f"  {len(byte_tokens)} tokens: {byte_tokens}")
Text: Hello 世界 😀

Word tokenization:
  3 tokens: ['Hello', '世界', '😀']

Character tokenization:
  10 tokens: ['H', 'e', 'l', 'l', 'o', ' ', '世', '界', ' ', '😀']

Byte tokenization:
  17 tokens: [72, 101, 108, 108, 111, 32, 228, 184, 150, 231, 149, 140, 32, 240, 159, 152, 128]

4. BPE (Byte Pair Encoding) Tokenization¶

Subword tokenization - a middle ground between word and character level

Key idea: Start with characters, iteratively merge the most frequent pairs

Used by: GPT-2, GPT-3, GPT-4, LLaMA, and many modern LLMs

In [11]:
import tiktoken

# Use GPT-4's tokenizer (cl100k_base encoding)
enc = tiktoken.get_encoding("cl100k_base")

bpe_tokens = enc.encode(text)

print(f"Text: {text}")
print(f"\nBPE token IDs ({len(bpe_tokens)}):")
print(bpe_tokens)

print("\nDecoded tokens:")
for token_id in bpe_tokens:
    print(f"  {token_id} -> '{enc.decode([token_id])}'")
Text: Communication & Intelligence is awesome!

BPE token IDs (6):
[66511, 612, 22107, 374, 12738, 0]

Decoded tokens:
  66511 -> 'Communication'
  612 -> ' &'
  22107 -> ' Intelligence'
  374 -> ' is'
  12738 -> ' awesome'
  0 -> '!'

Sentence Tokenization¶

Breaking text into sentences

Hopefully you know what I am going to ask?

In [12]:
text = "This course will introduce fundamental concepts in natural language processing (NLP). It will cover the basics of enabling computers to understand and generate language, including word embeddings, language modeling, transformers, and an overview of large language models. It will also cover topics on connections with other disciplines such as linguistics and other social sciences."

print("Text:", text)
Text: This course will introduce fundamental concepts in natural language processing (NLP). It will cover the basics of enabling computers to understand and generate language, including word embeddings, language modeling, transformers, and an overview of large language models. It will also cover topics on connections with other disciplines such as linguistics and other social sciences.

Rule-based: spaCy Sentencizer¶

spaCy's Sentencizer uses simple punctuation rules:

In [13]:
import spacy
from spacy.lang.en import English

# Create a blank English model with rule-based sentencizer
nlp = English()
nlp.add_pipe("sentencizer")

doc = nlp(text)
sentences = list(doc.sents)

print(f"Found {len(sentences)} sentences (spaCy rule-based):\n")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent.text.strip()}")
Found 3 sentences (spaCy rule-based):

1. This course will introduce fundamental concepts in natural language processing (NLP).
2. It will cover the basics of enabling computers to understand and generate language, including word embeddings, language modeling, transformers, and an overview of large language models.
3. It will also cover topics on connections with other disciplines such as linguistics and other social sciences.

NLTK Sentence Tokenizer¶

NLTK uses a trained model (Punkt) that handles abbreviations better:

In [14]:
from nltk.tokenize import sent_tokenize

nltk_sentences = sent_tokenize(text)

print(f"Found {len(nltk_sentences)} sentences (NLTK Punkt):\n")
for i, sent in enumerate(nltk_sentences, 1):
    print(f"{i}. {sent.strip()}")
Found 3 sentences (NLTK Punkt):

1. This course will introduce fundamental concepts in natural language processing (NLP).
2. It will cover the basics of enabling computers to understand and generate language, including word embeddings, language modeling, transformers, and an overview of large language models.
3. It will also cover topics on connections with other disciplines such as linguistics and other social sciences.

What about this text?¶

In [15]:
text = "I work at U.of.C. What about you?"

print("Text:", text)
Text: I work at U.of.C. What about you?
In [16]:
doc = nlp(text)
sentences = list(doc.sents)

print(f"Found {len(sentences)} sentences (spaCy rule-based):\n")
for i, sent in enumerate(sentences, 1):
    print(f"{i}. {sent.text.strip()}")
Found 2 sentences (spaCy rule-based):

1. I work at U.of.
2. C. What about you?
In [17]:
nltk_sentences = sent_tokenize(text)

print(f"Found {len(nltk_sentences)} sentences (NLTK Punkt):\n")
for i, sent in enumerate(nltk_sentences, 1):
    print(f"{i}. {sent.strip()}")
Found 2 sentences (NLTK Punkt):

1. I work at U.of.C.
2. What about you?

Takeaways¶

  • Humans read texts in ways that may differ from the machines.

  • What is important to humans may not be important for machines.