nutilz
πŸͺ™

AI Token Counter

Count tokens and estimate API cost for any LLM model

OpenAI:
Anthropic:
Google:
Try sample:

πŸ”’No upload β€” runs entirely in your browser. Your text never leaves your device.

0

Tokens

0

Characters

0

Words

β€”

Context used

GPT-4o context window0 / 128,000 tokens

Estimated API Cost β€” GPT-4o

RequestsInput costOutput cost*Total
1$0.00$0.00$0.00
100$0.00$0.00$0.00
1,000$0.00$0.00$0.00
1,000,000$0.00$0.00$0.00

* Output cost assumes the model generates the same number of tokens as the input. Actual output length varies. Prices are approximate β€” check each provider's pricing page for current rates. Input: $2.5/1M tokens Β· Output: $10/1M tokens.

What Are AI Tokens and Why Do They Matter?

When you send text to an AI API β€” whether it's GPT-4o, Claude, or Gemini β€” the model never sees raw characters. Instead, it sees a sequence of tokens: compact numeric IDs that represent fragments of text. Tokenization is the first step in every large language model (LLM) inference pipeline, and understanding it is the single most practical skill for controlling API costs and avoiding context window errors in production.

For English text, a rough rule of thumb is 1 token β‰ˆ 4 characters, or about 0.75 words per token (meaning roughly 1.33 tokens per word). A 500-word blog post is around 670 tokens. A 10,000-word document is around 13,000 tokens. But these are averages β€” token boundaries shift depending on the vocabulary each model was trained with, the language of the text, and how much punctuation or code appears.

How Tokenization Works: BPE Under the Hood

Most frontier LLMs use Byte Pair Encoding (BPE) to split text into tokens. BPE starts with individual characters and iteratively merges the most frequent adjacent pairs until it builds a vocabulary of a fixed size (50,000–200,000+ entries for modern models). The result is a vocabulary where common words and word fragments each have their own token ID, while rare words get split into subword pieces.

Concrete examples from GPT-4's cl100k_base vocabulary:

  • β€œthe” β†’ 1 token (extremely common)
  • β€œunderstanding” β†’ 3 tokens: β€œunder” + β€œstand” + β€œing”
  • β€œGPT-4” β†’ 3 tokens: β€œGP” + β€œT” + β€œ-4”
  • β€œHello, world!” β†’ 4 tokens: β€œHello” + β€œ,” + β€œ world” + β€œ!”

Note that spaces are typically merged with the following wordinto a single token β€” this is why β€œ world” (with a leading space) is one token rather than two. This detail is invisible to users but explains why you can't simply count characters and divide by 4 and expect exact results for every input.

Real Use Case: Estimating Cost Before You Build

Suppose you're building a document summarization service. Your users upload PDFs of roughly 20 pages (β‰ˆ8,000 words β‰ˆ 10,700 tokens). Each request uses a 500-token system prompt and generates a 400-token summary. Total per request: ~11,600 input tokens + 400 output tokens.

Using GPT-4o at $2.50/M input and $10.00/M output:

  • Input: (11,600 / 1,000,000) Γ— $2.50 = $0.029 per request
  • Output: (400 / 1,000,000) Γ— $10.00 = $0.004 per request
  • Total: $0.033 per request β€” about $33 per 1,000 documents

Switching to GPT-4o Mini at $0.15/M input and $0.60/M output:

  • Input: $0.00174 Β· Output: $0.00024 Β· Total: $0.002 per request
  • That's $2 per 1,000 documents β€” a 94% cost reduction

This kind of calculation β€” running token counts through the cost formula before you write a single line of integration code β€” can save thousands of dollars and prevent architecture mistakes that are painful to unwind after launch.

Context Windows: The Hard Limit

Every model has a context window: the maximum total tokens (input + output) it can process in a single API call. Requests that exceed this limit are rejected with an error β€” the model does not silently truncate your input. Common context windows in 2025:

  • GPT-4o / GPT-4o Mini: 128,000 tokens (~96,000 words)
  • o1 / o3: 200,000 tokens (~150,000 words)
  • Claude Opus 4 / Sonnet / Haiku: 200,000 tokens
  • Gemini 1.5 Pro / 1.5 Flash / 2.0 Flash: 1,000,000 tokens (~750,000 words)

In chat applications, context window pressure builds over time as you include the full conversation history in every request. A 1-hour support chat transcript can easily reach 20,000+ tokens. Strategies to manage this include summarizing older turns into a compact β€œmemory block” and only including recent messages verbatim.

Input vs Output: Where the Cost Actually Lives

One of the most counterintuitive facts about LLM pricing is that output tokens cost 3–5Γ— more than input tokensper million. This is because generating each output token requires one full forward pass through the entire model (autoregressive decoding), while input tokens are processed more efficiently in parallel using attention mechanisms. The practical implication: long, verbose AI responses cost far more than concise ones. Prompting the model to be concise (β€œAnswer in under 100 words”) is a legitimate cost-reduction technique, not just an aesthetic preference.

For most chat applications, input tokens dominate because you include the conversation history, system prompt, and any retrieved documents in every request. As history grows, per-turn input cost grows linearly. Batching, caching, and summarization all attack this growth directly.

Prompt Caching: A Hidden Cost Lever

Both Anthropic and OpenAI now offer prompt caching, which lets you mark a prefix of your prompt as cacheable. If the next API call uses the same prefix, the cached tokens cost dramatically less: Anthropic charges ~10% of normal input price for cache hits; OpenAI charges 50% for cached tokens in most tiers. For applications where a large system prompt or reference document is reused across many requests, prompt caching can cut total API spend by 50–80% without any change to quality.

To use caching effectively, structure your prompts so that the stable, reusable content (system instructions, documents, few-shot examples) comes first, and the dynamic, per-request content (user message, query) comes last. The cache hit applies only to the identical prefix, so even a single character difference in the prefix invalidates it.

Tokens Across Languages and Code

The 4-characters-per-token rule applies primarily to English text. Other languages and content types tokenize differently:

  • Non-Latin scripts (Chinese, Japanese, Arabic, Hindi): Often 1–3 characters per token, meaning the same semantic content costs 2–4Γ— more in tokens than English.
  • Source code: Typically 2–3 characters per token due to dense punctuation, brackets, and identifiers. A 100-line Python file is roughly 1,000–1,500 tokens.
  • Numbers: Each digit is often its own token. β€œ1000000” (7 digits) can be 4–7 tokens.
  • URLs: Slashes, dots, and hyphens are separate tokens. A long URL can tokenize to 30+ tokens.

When working with non-English content or code-heavy prompts, add a 20–40% buffer to your token estimates to avoid unexpected context window overruns.

Frequently Asked Questions

What is an AI token and why does it matter?β–Ό

A token is the basic unit that large language models (LLMs) use to process text. Depending on the model and language, a token is roughly 3–5 characters of English text β€” common words like "the" or "is" are usually a single token, while longer or rarer words get split into multiple tokens (subwords). Tokens matter because every major LLM API β€” OpenAI, Anthropic, Google β€” charges per token for both the text you send (input tokens) and the text the model generates (output tokens). Understanding token counts lets you predict API costs, avoid exceeding context window limits, and optimize prompts to be more cost-efficient.

How accurate is this token counter?β–Ό

This tool uses a regex-based approximation of Byte Pair Encoding (BPE) tokenization, which achieves roughly 90–95% accuracy compared to each model's native tokenizer for standard English text. Accuracy is highest for plain English prose and lower for dense code, non-Latin scripts, or highly technical content with many punctuation symbols. For exact counts on critical production workloads, use the official tokenizers: the tiktoken library for OpenAI models (available via npm or PyPI) and the anthropic Python/TypeScript SDK's token_count method for Claude models. This tool is ideal for quick estimates during prompt design and cost planning.

How do I estimate API costs for different models?β–Ό

API cost equals (input tokens Γ· 1,000,000) Γ— input price + (output tokens Γ· 1,000,000) Γ— output price. Input price is what you pay for the text you send (your prompt and any context); output price is what you pay for the text the model generates (its response). Output tokens are typically 2–5Γ— more expensive than input tokens. To estimate total cost: count your input tokens with this tool, estimate how many tokens the model will generate in its response, then apply both rates. The cost table in this tool shows the input-side cost automatically and uses a 1:1 output ratio as a baseline for comparison.

What is the difference between input tokens and output tokens?β–Ό

Input tokens (also called prompt tokens) are all the tokens in the text you send to the model β€” your system prompt, user message, conversation history, and any documents or context you include. Output tokens (also called completion tokens) are the tokens in the text the model generates in response. Both are billed separately at different rates: output tokens cost more because generating each token requires a full forward pass through the model, while input tokens can be processed more efficiently in parallel. For most chat and Q&A use cases, input tokens exceed output tokens because you include full conversation history in each request.

What happens when I exceed a model's context window?β–Ό

Every LLM has a context window limit β€” the maximum total number of input plus output tokens it can process in a single API call. For example, GPT-4o supports up to 128,000 tokens, Claude models support up to 200,000 tokens, and Gemini 1.5 Pro supports up to 1,000,000 tokens. If your prompt plus expected output exceeds this limit, the API returns an error and the request fails. Common strategies to stay within limits include: chunking long documents into smaller pieces, using retrieval-augmented generation (RAG) to fetch only relevant sections, summarizing older conversation history, and using models with larger context windows for long-document tasks.

Why do the same words use different numbers of tokens across models?β–Ό

Each AI provider trains its own tokenizer (vocabulary) independently. OpenAI's GPT models use the cl100k_base or o200k_base tokenizer. Anthropic's Claude uses a different vocabulary. Google's Gemini uses yet another. Because each vocabulary is trained on different data with different merge rules, the same sentence may tokenize into slightly different numbers of tokens β€” typically within 5–15% of each other for English text. The difference is more pronounced for code, non-English languages, and special characters. This is why token counts for GPT-4o and Claude Sonnet may differ slightly for the same input.

How can I reduce my API token usage and costs?β–Ό

The most effective strategies to lower token usage are: (1) Write concise system prompts β€” trim any unnecessary instructions or repeated context. (2) Use prompt caching β€” both Anthropic and OpenAI offer caching for repeated prompt prefixes at a significant discount. (3) Choose the right model β€” use smaller, cheaper models (GPT-4o Mini, Claude Haiku) for simple tasks and only escalate to larger models when needed. (4) Truncate conversation history β€” instead of sending the full chat history every turn, summarize older messages. (5) Use structured outputs β€” request JSON with a strict schema to avoid verbose prose responses. (6) Batch requests β€” group multiple independent queries into a single API call where the model's architecture allows it.