# counting words in 2010

## counting words in 2010

 As I understand it, TeXShop uses /usr/texbin/detex to calculate document statistics (words, characters, lines). Specifically, it calls detex via a wrapper script included in the application's resources. (In my case, the wrapper has been "tweaked" but this is not relevant here.) The problem I'm seeing is with /usr/texbin/detex as supplied with TeX Live 2010 as opposed to the versions supplied with TeX Live 2008 and 2009. Essentially, I'm getting much lower word counts than I should because detex is stripping out text which it really shouldn't. The things I'm certain about include footnote text and italicised text but I suspect these are just a part of the problem. I'm hoping this isn't intended to be a feature. Does anybody know: - if this is a known (or unknown) bug? - if there is any way of working around it? (I'm currently using the    2009 issue of detex but that's a bit messy.) - if there is a better way of getting document statistics? Specifically, I need word counts which are as accurate as possible. But if there is to be inaccuracy, it is generally better if the count is reported as slightly higher than it really is rather than lower because I'm typically trying to write stuff which does not exceed a given limit. This makes the current detex almost useless. I know detex is used for more than word counts but can't imagine what purpose is served by stripping out italic text, for example. Please, this isn't supposed to be a feature, is it? Please?! This is also intended to alert people who rely on TeXShop's statistics (or detex | wc) that the results may be unreliable with TeX Live 2010. Perhaps I missed it, but I don't recall seeing any warnings to this effect or information about changes to the current version of detex. (If anybody saw such and can send me a pointer, that'd be great.) Thanks, cfr
## Re: counting words in 2010

 I'm not sure if this is a solution, but Excalibur counts words as it spell-checks. There is also WordService from Devon Technologies that works with many programs; it has a word count feature. Free. http://www.devon-technologies.com/products/freeware/services.htm

David Derbes
U of Chicago Laboratory Schools

On Oct 28, 2010, at 6:56 PM, Dr. Clea F. Rees wrote:

> As I understand it, TeXShop uses /usr/texbin/detex to calculate
> document statistics (words, characters, lines). Specifically, it calls
> detex via a wrapper script included in the application's resources. (In
> my case, the wrapper has been "tweaked" but this is not relevant here.)
>
> The problem I'm seeing is with /usr/texbin/detex as supplied with TeX
> Live 2010 as opposed to the versions supplied with TeX Live 2008 and
> 2009. Essentially, I'm getting much lower word counts than I should
> because detex is stripping out text which it really shouldn't. The
> things I'm certain about include footnote text and italicised text but
> I suspect these are just a part of the problem.
>
> I'm hoping this isn't intended to be a feature. Does anybody know:
> - if this is a known (or unknown) bug?
> - if there is any way of working around it? (I'm currently using the
>  2009 issue of detex but that's a bit messy.)
> - if there is a better way of getting document statistics?
>
> Specifically, I need word counts which are as accurate as possible. But
> if there is to be inaccuracy, it is generally better if the count is
> reported as slightly higher than it really is rather than lower because
> I'm typically trying to write stuff which does not exceed a given limit.
> This makes the current detex almost useless.
>
> I know detex is used for more than word counts but can't imagine what
> purpose is served by stripping out italic text, for example. Please,
> this isn't supposed to be a feature, is it? Please?!
>
> This is also intended to alert people who rely on TeXShop's statistics
> (or detex | wc) that the results may be unreliable with TeX Live 2010.
> Perhaps I missed it, but I don't recall seeing any warnings to this
> effect or information about changes to the current version of detex.
> (If anybody saw such and can send me a pointer, that'd be great.)
>
> Thanks,
> cfr
## Re: counting words in 2010

 On Thu 28th Oct, 2010 at 19:59, David Derbes seems to have written:

> I'm not sure if this is a solution, but Excalibur counts words as it spell-checks.

Hmm... I didn't know that. How accurate is it?

> There is also WordService from Devon Technologies that works with many programs; it has a word count feature.
> Free.
>
> http://www.devon-technologies.com/products/freeware/services.htm

I use this for paragraphs but it is no use where there's a lot of markup or for entire documents because it doesn't filter out the TeX stuff at all. But you are right that it is a very useful service to have installed.

Thanks,
cfr
## Re: counting words in 2010

 On Oct 28, 2010, at 4:56 PM, Dr. Clea F. Rees wrote:

> The problem I'm seeing is with /usr/texbin/detex as supplied with TeX
> Live 2010 as opposed to the versions supplied with TeX Live 2008 and
> 2009. Essentially, I'm getting much lower word counts than I should
> because detex is stripping out text which it really shouldn't. The
> things I'm certain about include footnote text and italicised text but
> I suspect these are just a part of the problem.

In testing some short example files, it seemed that the contents of \emph{}, \textbf{} and so on are being counted, but not the contents of \footnote{}. However, footnote text built with the construction \begin{footnote}{}\end{footnote} seemed to be correctly counted. The man page for detex spells out the limited customizations that may be applied to detex.

Michael
## Re: counting words in 2010

 Command-line tool texcount is great for this. I use the -sum flag, like

texcount -sum foo.tex

and I'm happy with the results. However, I don't honestly know whether it is subject to the complaint you have with detex.
## Re: counting words in 2010

 On 29.10.2010, at 06:11, C.H.E. wrote:

> Command-line tool texcount is great for this. I use the -sum flag, like
>
> texcount -sum foo.tex
>
> and I'm happy with the results. However, I don't honestly know whether it is
> subject to the complaint you have with detex.

I am using two engines for wordcounting that are based on texcount. They are shipped out with TeXShop, see

/Applications/TeX/TeXShop.app/Contents/Resources/TeXShop/Engines/Inactive/texcount/

Ramon wrote AppleScripts:
http://www2.hawaii.edu/~ramonf/TeXShop/index.html

For me, texcount seems to the most accurate of the tools available. And it is actively developed.

http://folk.uio.no/einarro/Comp/texwordcount.html

has 2.3alpha, MacTeX 2010 comes with 2.2

Daniel
## Re: counting words in 2010

 On 29.10.2010, at 08:15, Daniel Becker wrote:

>
> http://folk.uio.no/einarro/Comp/texwordcount.html
> has 2.3alpha, MacTeX 2010 comes with 2.2

The zip file there seems to be broken. The one from

http://folk.uio.no/einarro/TeXcount/download.html

can be unzipped.

Daniel
## Re: counting words in 2010

 On Thu 28th Oct, 2010 at 18:59, Michael Sharpe seems to have written:

>
> On Oct 28, 2010, at 4:56 PM, Dr. Clea F. Rees wrote:
>
>> The problem I'm seeing is with /usr/texbin/detex as supplied with TeX
>> Live 2010 as opposed to the versions supplied with TeX Live 2008 and
>> 2009. Essentially, I'm getting much lower word counts than I should
>> because detex is stripping out text which it really shouldn't. The
>> things I'm certain about include footnote text and italicised text but
>> I suspect these are just a part of the problem.
>
> In testing some short example files, it seemed that the contents of \emph{}, \textbf{} and so on are being counted, but not the contents of \footnote{}. However, footnote text built with the construction \begin{footnote}{}\end{footnote} seemed to be correctly counted. The man page for detex spells out the limited customizations that may be applied to detex.

Strange. detex is definitely taking out the contents of \emph{} here. If I run detex without piping through wc, I can see the relevant words are missing. (This is without passing detex any options.) Maybe we are using different versions of detex? Earlier versions definitely did not behave in this way.

Thanks,
cfr
## Re: counting words in 2010

 On Oct 30, 2010, at 12:46 PM, <[hidden email]> <[hidden email]> wrote:

> Strange. detex is definitely taking out the contents of \emph{} here.
> If I run detex without piping through wc, I can see the relevant words
> are missing. (This is without passing detex any options.) Maybe we are
> using different versions of detex? Earlier versions definitely did not
> behave in this way.

I'm using the intel 64-bit version of detex dated 7/13/10, though I get the same results if I use the universal 32-bit version of detex dated 6/16/10. In both cases, the detex output from the line

Test\footnote{ a footnote}  \emph{it} quickly.

is

Test  it quickly.

Michael
## Re: counting words in 2010

 On Sat 30th Oct, 2010 at 13:34, Michael Sharpe seems to have written:

>
> On Oct 30, 2010, at 12:46 PM, <[hidden email]> <[hidden email]> wrote:
>
>> Strange. detex is definitely taking out the contents of \emph{} here.
>> If I run detex without piping through wc, I can see the relevant words
>> are missing. (This is without passing detex any options.) Maybe we are
>> using different versions of detex? Earlier versions definitely did not
>> behave in this way.
>
> I'm using the intel 64-bit version of detex dated 7/13/10,
> though I get the same results if I use the universal 32-bit version of detex dated 6/16/10.

My version is dated 17/6/2010. I'm definitely using the universal 32-bit version or it wouldn't be working at all.

> In both cases, the detex output from the line
>
> Test\footnote{ a footnote}  \emph{it} quickly.
>
> is
>
> Test  it quickly.

I can't now reproduce the disappearance of \emph text. Not even using the same document. (I've edited it but not any of the TeX stuff.) However, when I put your sample sentence into a test file and run detex, I get:

  Test  a footnote  it quickly.

Although footnotes are still being deleted from my paper when run through detex. So now I'm just confused and have no idea what's going on.

I get the following results:

detex 2010 + wc: 208    3088   19174
detex 2009 + wc: 192    3215   20003
detex 2008 + wc: 192    3215   20003

texcount:
  Words in text: 3115
  Words in headers: 17
  Words in float captions: 121
  Number of headers: 5
  Number of floats: 0
  Number of math inlines: 8
  Number of math displayed: 0

All of these are being run without any customisation. I'm not surprised to get a different result from texcount, of course. (Though I'd like to know which way of counting is most accurate!) But I'm curious about the different results from different versions of detex. The documentation doesn't seem to be any different. I downloaded and compiled the source and get the same results as those for 2010 above. That version is 2.8 but the documentation still refers to 2.6. So maybe changes made to 2.7 are responsible for the different results? (Versions 2.7 and 2.8 are identical aside from the licence, I believe.)

Thanks,
Clea
## Re: counting words in 2010

 Am 30.10.2010 um 21:46 schrieb <[hidden email]>:

> Maybe we are using different versions of detex?

Maybe! I don't have the stable version from TL '10, but the detex versions from TL '08 'til '11 work OK, as expected.

Clea, can you describe exactly how you perform your tests?

-- 
Greetings

   Pete

No project was ever completed on time and within budget.
                                – Cheops Law
## Re: counting words in 2010

 On Oct 30, 2010, at 5:21 PM, Peter Dyballa wrote: > > Am 30.10.2010 um 21:46 schrieb <[hidden email]>: > >> Maybe we are using different versions of detex? > > Maybe! I don't have the stable version from TL '10, but the detex versions from TL '08 'til '11 work OK, as expected. > > Clea, can you describe exactly how you perform your tests? > > -- > Greetings > >  Pete Howdy, You have the detex from TL'11? :-) With TL-2010 if I run detex on \documentclass{article} \begin{document} Hello World\footnote{ footnote} \end{document} I only get Hello World rather than the expected Hello World footnote isn't that what you get too? Good Luck, Herb Schulz (herbs at wideopenwest dot com) ----------- Please Consult the Following Before Posting ----------- TeX FAQ: http://www.tex.ac.uk/faqList Reminders and Etiquette: http://email.esm.psu.edu/mac-tex/List Archive: http://tug.org/pipermail/macostex-archives/TeX on Mac OS X Website: http://mactex-wiki.tug.org/List Info: http://email.esm.psu.edu/mailman/listinfo/macosx-tex
## Re: counting words in 2010

 Am 31.10.2010 um 00:43 schrieb Herbert Schulz: > You have the detex from TL'11? :-) Compiled from the sources... > > isn't that what you get too? Well, knowing that Einstein was German (really? at least he spoke   German!) and wanted that things were made simple, I test this simple   way on the command line:         echo "Please test\\footnote{\\textbf{this} footnote}, \\emph{but}   quickly!"         echo "Please test\\footnote{\\textbf{this} footnote}, \\emph{but}   quickly!" | wc         echo "Please test\\footnote{\\textbf{this} footnote}, \\emph{but}   quickly!" | /usr/local/texlive/2010/bin/universal-darwin/detex | wc In the last case I can use different versions of detex. The results   from the three different command lines are:         Please test\footnote{\textbf{this} footnote}, \emph{but} quickly!                1       5      66                1       6      40 The first line shows that the syntax chosen is OK, the second line   counts the run-together words OK (second figure, first one is the   number of lines, last one that of the characters of the input line),   and the last line is correctly filtered by detex. This command shows   how it correctly filters:         echo "Please test\\footnote{\\textbf{this} footnote}, \\emph{but}   quickly!" | /usr/local/texlive/2010/bin/universal-darwin/detex         Please test this footnote, but quickly! Again, I do not have the official/ready/finished detex version of TL   '10. Maybe this file is defective... Well, which "detex" are you actually using? One via a TeXShop engine?   Could you add to that engine file:         echo -n "The detex programme soon to be used is certainly this one:   " ; which detex Thw output will appear in the console window, together with the word   count. Does the word count come closer to the expected value with a text body   of         Hello World\footnote{ footnote} and more or         Hello World\footnote{ footnote and more} or         Hello World\footnote{ footnote}! -- Greetings    Pete November, n.:         The eleventh twelfth of a weariness.                 – Ambrose Bierce, "The Devil's Dictionary" ----------- Please Consult the Following Before Posting ----------- TeX FAQ: http://www.tex.ac.uk/faqList Reminders and Etiquette: http://email.esm.psu.edu/mac-tex/List Archive: http://tug.org/pipermail/macostex-archives/TeX on Mac OS X Website: http://mactex-wiki.tug.org/List Info: http://email.esm.psu.edu/mailman/listinfo/macosx-tex
## Re: counting words in 2010

## Re: counting words in 2010

 Am 31.10.2010 um 23:55 schrieb Michael Sharpe: > This does of course give an incorrect count because words were run   > together in LaTeX mode that were not run together in plain tex mode.   > I would consider thia a bug in detex. (I'm using the x86-64 version   > that came with TeXLive 2010, but I get exactly the same result with   > detex from the 2008 distribution.) I can confirm that detex from the test or pre-test versions of TL '10   and '11 shows the faulty behaviour you describe and Clea first found.   I cannot confirm the behaviour in TL '08, for me it's correct. I have   PPC hardware and PPC binaries, except for TL '10 pre, which are UB. The same faulty behaviour appears when using -w (or -wt) and -wl: In   the LaTeX case the footnote text is removed. (BTW, after the comma and   before \\emph I had inserted a TAB. In your output Michael, you can   see the SPACE character which is substituted for the footnote text,   see later.) The problem is also with the test file for detex:         \documentclass{article}         \begin{document}                 This is the first paragraph.                 \section{First Section}                 Preamble of Sect.~1.                 \subsection{A Subsection}                 Here some text, an inline formula $(a+b)^2=a^2+2ab+b^2$, as well         as a displayed equation         %                 e^{\pm ix}=\cos x \pm i \sin x\;,                 %         and some more text.                 Now some verbatim text \verb|a b c|.  That's all, folks.                 \end{document} There is no \footnote{} in it... There is also another bug in detex, I think. I looked into the (F)LEX   output, which has:         "\\part"{Z} ;         "\\section"{Z} ;         "\\subsection"{Z} ;         "\\subsubsection"{Z} ;         "\\paragraph"{Z} ;         "\\sunparagraph"{Z} ; Well, I *can* understand why in November one thinks of the sun – but   it's wrong! Later this line comes:         "\\footnote" {KILLARGS(1); SPACE;} Here we have it: a footnote is not assumed to be counted... Other bugs, a "d" too much in "and":         ErrorExit("-e option requires and argument"); and too obvious:             ErrorExit("The environtment list contains too many environments"); At daylight I'll send a message to the TeX Live list, I won't mention   the footnote problem. Clea, are you going to send a message yourself? -- Greetings    Pete November, n.:         The eleventh twelfth of a weariness.                 – Ambrose Bierce, "The Devil's Dictionary" ----------- Please Consult the Following Before Posting ----------- TeX FAQ: http://www.tex.ac.uk/faqList Reminders and Etiquette: http://email.esm.psu.edu/mac-tex/List Archive: http://tug.org/pipermail/macostex-archives/TeX on Mac OS X Website: http://mactex-wiki.tug.org/List Info: http://email.esm.psu.edu/mailman/listinfo/macosx-tex
## Re: counting words in 2010

 On Mon 1st Nov, 2010 at 01:02, Peter Dyballa seems to have written: > > Am 31.10.2010 um 23:55 schrieb Michael Sharpe: > >> This does of course give an incorrect count because words were run together >> in LaTeX mode that were not run together in plain tex mode. I would >> consider thia a bug in detex. (I'm using the x86-64 version that came with >> TeXLive 2010, but I get exactly the same result with detex from the 2008 >> distribution.) > > > I can confirm that detex from the test or pre-test versions of TL '10 and '11 > shows the faulty behaviour you describe and Clea first found. I cannot > confirm the behaviour in TL '08, for me it's correct. I have PPC hardware and > PPC binaries, except for TL '10 pre, which are UB. I also don't see the problem with TL '08. Similarly for TL '09. It first appears for me in TL '10. And I am also on PPC. > The same faulty behaviour appears when using -w (or -wt) and -wl: In the > LaTeX case the footnote text is removed. (BTW, after the comma and before > \\emph I had inserted a TAB. In your output Michael, you can see the SPACE > character which is substituted for the footnote text, see later.) > > > The problem is also with the test file for detex: > > \documentclass{article} > \begin{document} > > This is the first paragraph. > > \section{First Section} > > Preamble of Sect.~1. > > \subsection{A Subsection} > > Here some text, an inline formula $(a+b)^2=a^2+2ab+b^2$, as well > as a displayed equation > % > > e^{\pm ix}=\cos x \pm i \sin x\;, > > % > and some more text. > > Now some verbatim text \verb|a b c|.  That's all, folks. > > \end{document} > > There is no \footnote{} in it... > > > There is also another bug in detex, I think. I looked into the (F)LEX output, > which has: > > "\\part"{Z} ; > "\\section"{Z} ; > "\\subsection"{Z} ; > "\\subsubsection"{Z} ; > "\\paragraph"{Z} ; > "\\sunparagraph"{Z} ; > > Well, I *can* understand why in November one thinks of the sun – but it's > wrong! Later this line comes: > > "\\footnote" {KILLARGS(1); SPACE;} > > Here we have it: a footnote is not assumed to be counted... > > > Other bugs, a "d" too much in "and": > > ErrorExit("-e option requires and argument"); > > and too obvious: > >    ErrorExit("The environtment list contains too many > environments"); > > > At daylight I'll send a message to the TeX Live list, I won't mention the > footnote problem. Clea, are you going to send a message yourself? I can certainly do that. Thanks, Clea > -- > Greetings > > Pete > > November, n.: > The eleventh twelfth of a weariness. > – Ambrose Bierce, "The Devil's Dictionary" >----------- Please Consult the Following Before Posting ----------- TeX FAQ: http://www.tex.ac.uk/faqList Reminders and Etiquette: http://email.esm.psu.edu/mac-tex/List Archive: http://tug.org/pipermail/macostex-archives/TeX on Mac OS X Website: http://mactex-wiki.tug.org/List Info: http://email.esm.psu.edu/mailman/listinfo/macosx-tex