LaTeX Word Counting Methods: Comparing the Approaches
Ask three tools to count the words in the same LaTeX file and you’ll often get three different numbers. That isn’t a bug. Each method draws the line between “text” and “not text” in a different place, and which one is right depends on what you’re counting for. This piece compares the main approaches to LaTeX word counting, what each one actually measures, and where each one breaks. If you just want the commands to run, the practical LaTeX word count guide covers the texcount workflow directly.
What makes a LaTeX word count ambiguous
The trouble is that a .tex file is source code, not finished text. The same file contains:
- Prose, the part you mean when you say “word count”.
- Markup such as
\section{},\textbf{}, and\cite{key}. - Math, where
$x^2 + y^2 = z^2$renders as symbols, not five English words. - Comments after
%, which never reach the reader. - Captions, footnotes, and headers, which a limit may or may not include.
Any counting method is really a set of decisions about each of those categories. A naive wc -w counts everything, including backslashes and braces, so it overcounts badly. A stripping approach removes markup but can mangle prose. A parser-based tool like texcount tries to understand structure, which is more accurate but never perfect. Knowing the failure mode of each one is the difference between trusting a number and guessing.
Method 1: wc on the raw source
The Unix wc tool is the baseline, and it’s wrong on purpose here:
wc -w document.tex It treats \section{Introduction} as words and counts every command. The result is inflated and meaningless for a submission. It’s still useful as a sanity check: if texcount says 6,000 and wc says 6,400, the gap is roughly your markup, which is plausible. A gap of 3x means something is misconfigured.
Method 2: strip the markup, then count
The idea: remove LaTeX commands first, then run a plain word counter. The classic tool is detex, which ships with most TeX distributions:
# Strip LaTeX, then count
detex document.tex | wc -w
# -l forces it to follow input and include
detex -l main.tex | wc -w You can approximate the same thing with sed, though it’s blunt:
# Crude: delete anything that looks like a command, then count
sed -E 's/\[a-zA-Z]+({[^}]*})?//g' document.tex | wc -w Stripping is fast and dependency-light. Its weakness is that it doesn’t understand context. detex leaves the readable text of some commands and drops the argument of others, so captions and footnotes get treated inconsistently, and nested braces confuse the sed version entirely. Math handling is hit or miss. For a rough running total this is fine. For a hard limit, it’s too approximate.
Method 3: texcount, the parser
texcount is a Perl script (not a LaTeX package, despite how often it’s described as one) that parses the document structure rather than pattern-matching it. That’s why it can report words in text, words in headers, and words in captions as separate figures, and count floats and math expressions on their own.
texcount document.tex Because it parses, you can teach it about your own commands using %TC: comment directives placed in the source:
% Count the argument of \keyword as text
%TC:macro \keyword [text]
% Ignore a \todo note completely
%TC:macro \todo [ignore]
% Don't count the body of a custom environment
%TC:envir solution [ignore] ignore This is the most accurate general method, and it’s what most journals and thesis offices implicitly expect, because its “words in text” figure matches the human notion of body text. Its limits are real though. It can’t know what an undefined macro expands to, so heavy macro use needs directives, and very unusual document structures can still trip it. The fix is always the same: run texcount -v to see what it counted, then add directives where it guessed wrong.
Method 4: a script you control
When you need counting rules that no tool offers out of the box, a small script gives you full control. A regex approach in Python is enough for most custom rules:
import re
import sys
def count_latex_words(path):
with open(path, encoding="utf-8") as f:
content = f.read()
# Drop comments (a % not preceded by a backslash)
content = re.sub(r"(?<!\)%.*", "", content)
# Drop inline and display math
content = re.sub(r"$$.*?$$", "", content, flags=re.DOTALL)
content = re.sub(r"$[^$]*$", "", content)
# Drop commands, keeping any text in their final argument is harder,
# so this version simply removes the command and its first {...}
content = re.sub(r"\[a-zA-Z]+*?([[^]]*])?({[^}]*})?", "", content)
return len(re.findall(r"w+", content))
if __name__ == "__main__":
print(count_latex_words(sys.argv[1])) Be honest about what this is: a heuristic. Regex can’t truly parse LaTeX, so nested braces, verbatim blocks, and unusual environments will slip past it. The value of rolling your own isn’t accuracy, it’s that you can encode a rule your institution invented, like “count everything in \chapter bodies but nothing in \marginpar”. For anything standard, texcount will beat your script. For one weird local requirement, the script wins.
Choosing a method
| Method | Accuracy | Setup | Best for |
|---|---|---|---|
wc -w | Poor (counts markup) | None | A quick sanity check |
detex + wc | Rough | Usually preinstalled | Running totals while drafting |
| texcount | High | Install once, add directives | Submissions, theses, the default choice |
| Custom script | As good as you make it | Real work | One-off institutional rules |
For the overwhelming majority of cases the answer is texcount. The other methods earn their place either as a cross-check (wc, detex) or as an escape hatch when no tool matches a strange requirement (a script).
Counting in a collaborative editor
All four methods above assume you run a command against a file on disk. That changes when the document lives in a browser and several people are editing it at once. A live count has to update as the text changes and reflect a multi-file project as it compiles, not as it was saved an hour ago.
inscrive.io handles word counting as part of real-time collaboration: the count tracks the document as authors type and recomputes across \input files on each compile, so what you see matches what’s in the PDF. It’s freemium, the Free plan is €0 with up to 10 active projects and 60-second compiles, and Pro is €7/month with 480-second compiles and AI-suggested fixes for compile errors. Everything is stored in the EU (Hetzner, Germany and Finland, ISO 27001 data centres) under a signed DPA, and your documents are never used to train AI models. If you’re weighing browser-based options, the online LaTeX editor comparison lays out how they differ on exactly these points.
The practical takeaway
There is no single correct LaTeX word count, only a count that matches the rule you were given. Pick texcount as your default, keep wc or detex around as a cross-check, and reach for a script only when a requirement is genuinely unusual. Whatever you choose, record the method next to the number. A count without its method is just a guess that happens to have digits.
Writing collaboratively and tired of running counts by hand? Start writing on inscrive.io for free and let the word count keep itself current across every file and every co-author.
Further reading
- LaTeX word count guide, the practical texcount workflow
- texcount manual, the official documentation
- LaTeX beginner’s guide, if you’re new to the markup




