Hairy Ticks of Dune

posted by SandChigger | 7:12 AM

Apologies for the unintended hiatus. Fortunately—or not, depending on how you view things—my role as "instructor of broods" will required minimal attention during the next moon or so, meaning I will have more time to devote to my side projects here!

I've been wanting to get back to the Celestia texture project, for one, and there still remains the thorough bash...er, examination of Hunters. It will take a little time to get back into the proper frame of mind for both, but let's begin for the moment with a few lexicographical observations, shall we?

For example, the first 63 pages of Hunters (all I have scanned at the moment) contain 14,953 words. The total vocabulary (unique word-forms) is only 3,244 words. Ranked by number of occurrences (frequency in the text), the top 20 words are the following:

Rank	Occurrences	Word	Total	Cumulative
1	1040	THE	0.07	0.07
2	386	OF	0.03	0.10
3	371	TO	0.02	0.12
4	341	AND	0.02	0.14
5	297	A	0.02	0.16
6	212	HAD	0.01	0.18
7	181	IN	0.01	0.19
8	157	HE	0.01	0.20
9	138	HIS	0.01	0.21
10	133	HER	0.01	0.22
11	127	THAT	0.01	0.23
12	117	AS	0.01	0.23
13	116	FROM	0.01	0.24
14	107	WAS	0.01	0.25
15	106	WITH	0.01	0.26
16	100	THEY	0.01	0.26
17	91	THEIR	0.01	0.27
18	88	ON	0.01	0.27
19	84	YOU	0.01	0.28
20	83	NOT	0.01	0.29
Tot	4275		~30% of total (14953)

(If there are wide gaps of blank space before and after the table above...I think we may be seeing here another of those reasons why Blogger sucks? I'll fix the CSS when I have time. Futz. If not...ignore this!)

Now, that's not incredibly interesting, is it? Basically just a bunch of gammatical function or other low-content words. And not all that unusual, either. Later I'll do the stats for a similar quantity of text from the beginning of Dune, just for comparison, but I don't expect it to be very different.

The first word that marks this as a Dune-related text doesn't come until the 28th rank: SHEEANA, occurring 65 times (0.43% of the total) or about once per page. Next, at 31st is HONORED (as in "Matre") with 58 tokes (0.39%), followed at 35th by DUNCAN (55, 0.37%), at 36th by BENE (54, 0.36%), and at 40th by MURBELLA (50, 0.33%). MATRES ties with the verb form IS with 44 occurrences each (0.29%). ALL, MOTHER, NO-SHIP, and OLD tie with 38 occurrences. GESSERIT beats out TLEILAXU by just 3 tokens (36 times). SPICE, REBECCA, UXTAL, CHAPTERHOUSE, RABBI, and TEG all appear before the 100th top-ranking word, which is SCATTERING (22 tokens). The first 100 top-ranked words account, cumulatively, for 50% of the total text.

A grand total of 2,042 words (63% of the total vocabulary of 3,244) are used only once. This is only about 13% of the total text, however, and again, not that unusual for an extract of this size.

These are just some of the results from a quick analysis on just an initial part of the text. It's interesting to look at the frequencies of the words used, but other than the vocabulary (here, again, the total number of unique word forms used in the text), none of this information is terribly useful in evaluating the text. Some of the things I will be looking at once I have the entire book in analyzable format will include (but not be limited to):

composition of higher level elements (sections & paragraphs)
sentence length stats
sentential & phrasal structural complexity
anaphoric expressions
repetition (restatement & literal)

And I'll even try to make it all interesting...not just bludgeon you with it like this time! :)

Hairy Ticks of Dune

Recent Posts

HToD Links

Other Links

Sunday, February 11, 2007

Hunters Lexicography

2 Comments:

About Me