Hairy Ticks of Dune

There's only room enough in this stillsuit for one of us! ... Wait, come back!

Sunday, February 11, 2007

Hunters Lexicography

Apologies for the unintended hiatus. Fortunately—or not, depending on how you view things—my role as "instructor of broods" will required minimal attention during the next moon or so, meaning I will have more time to devote to my side projects here!

I've been wanting to get back to the Celestia texture project, for one, and there still remains the thorough, examination of Hunters. It will take a little time to get back into the proper frame of mind for both, but let's begin for the moment with a few lexicographical observations, shall we?

For example, the first 63 pages of Hunters (all I have scanned at the moment) contain 14,953 words. The total vocabulary (unique word-forms) is only 3,244 words. Ranked by number of occurrences (frequency in the text), the top 20 words are the following:

3371TO 0.020.12
Tot4275~30% of total (14953)

(If there are wide gaps of blank space before and after the table above...I think we may be seeing here another of those reasons why Blogger sucks? I'll fix the CSS when I have time. Futz. If not...ignore this!)

Now, that's not incredibly interesting, is it? Basically just a bunch of gammatical function or other low-content words. And not all that unusual, either. Later I'll do the stats for a similar quantity of text from the beginning of Dune, just for comparison, but I don't expect it to be very different.

The first word that marks this as a Dune-related text doesn't come until the 28th rank: SHEEANA, occurring 65 times (0.43% of the total) or about once per page. Next, at 31st is HONORED (as in "Matre") with 58 tokes (0.39%), followed at 35th by DUNCAN (55, 0.37%), at 36th by BENE (54, 0.36%), and at 40th by MURBELLA (50, 0.33%). MATRES ties with the verb form IS with 44 occurrences each (0.29%). ALL, MOTHER, NO-SHIP, and OLD tie with 38 occurrences. GESSERIT beats out TLEILAXU by just 3 tokens (36 times). SPICE, REBECCA, UXTAL, CHAPTERHOUSE, RABBI, and TEG all appear before the 100th top-ranking word, which is SCATTERING (22 tokens). The first 100 top-ranked words account, cumulatively, for 50% of the total text.

A grand total of 2,042 words (63% of the total vocabulary of 3,244) are used only once. This is only about 13% of the total text, however, and again, not that unusual for an extract of this size.

These are just some of the results from a quick analysis on just an initial part of the text. It's interesting to look at the frequencies of the words used, but other than the vocabulary (here, again, the total number of unique word forms used in the text), none of this information is terribly useful in evaluating the text. Some of the things I will be looking at once I have the entire book in analyzable format will include (but not be limited to):

  • composition of higher level elements (sections & paragraphs)
  • sentence length stats
  • sentential & phrasal structural complexity
  • anaphoric expressions
  • repetition (restatement & literal)
And I'll even try to make it all interesting...not just bludgeon you with it like this time! :)


Anonymous Anonymous said...

You'll have your work cut out for you in the category of "repetition" :)

11:25 AM, February 13, 2007 
Blogger SandChigger said...


I wonder if it would be worthwhile to first categorize: repetition of things the reader should know from past books and repetition of things from just several chapters before.

(DAMN but that Bell was fat, huh?)

4:24 PM, February 13, 2007 

Post a Comment

<< Home