Tuesday, December 1, 2009


The sequel has arrived!

IPhOD version 2.0 can now be downloaded from iphod.com. I have improved the documentation on the website and I'm hoping to get version 2.0 search scripts and calculators up before the end of December. Following those final changes, the webpage and database should be stable for some time. If you have a chance, let me know what you think about the new organization of the webpage.

Here are the finer aspects of Version 2:

1. A better frequency measure.
I switched from Kucera-Francis to SUBTLEXus (Brysbaert & New), as the latter have done a marvelous job demonstrating the inadequacies of KF in explaining variance in psychological data. Furthermore, they have presented a good case that counts based on movie subtitles are a far superior option. I decided to include the frequency measure and context dependent measure, too. The impact of changing the frequency measure is that all of the frequency-weighted measures will likely be altered by having fresher and improved written counts.

2. More words.
There are 54k words in version 2.0, up from 33k. This means that all sorts of search terms may come up when you browse its contents. However, it also means that the phonotactic and density calculations in IPhOD now are calculated on an even wider base. Since many of the probability measures are relative frequency (divided by terms summed over all words, phoneme pairs, etc.) - more words should improve the accuracy.

3. Homophones and Homographs.
They're both included in the new database. I treated them differently when counting words, summing frequencies of homophones but counting them once for raw counts; while counting all homographs but adding their (written) counts once whenever they came up in counts.

4. Length-Constrained Positional Probability.
I have been contemplating this for a while, so I included a new measure that constrained the phoneme-positional counts to words of equal phoneme length. I'll have to explore this one later in more detail, but the intuition is that LCPP should be less variable across words of different phoneme lengths. Since length constraints group counts together, the effects of later positions could be smaller... However, my first peeks at the data found that those pesky regular CV structures and word-final consonant clusters probably will keep LC from being the final solution to the positional problem.

All of my excitement reflects just a few new files on the "Download" section. However, these changes should improve the ability of speech researchers (including me) to control and select words according to phonological factors. It will be terrific to find out if it really does!