Friday, August 14, 2009

IPhOD v1.4 (correcting an error in v1.3)

I recently found that IPhOD version 1.3 (Feb 2005) contained errors in the positional probability measures; it resulted in average positional probability calculations (columns 39-44) being 3-4 times higher than they should have been. As of August 14, 2009, all of the positional probability calculations in Version 1.4 have been corrected and updated throughout the IPhOD website, so whether you download a copy or search online, you will now be getting the corrected estimate of positional probability.

The good news is that the calculation error did not affect any of the other measures, and positional probability is not typically used independently of other measures.

Seems like this might be a good time to talk about positional probability a little more. Positional probability is a measure that is often used to help control or manipulate sublexical processing, along with biphoneme probability. The measure is calculated by counting the number of times that a phoneme occurs in a specific position, then dividing by the counts of all phonemes in that position. So in the word "cat", P(K,1), P(AE,2), and P(T,3) are the positional probabilities that must be determined. Once those are known, they are averaged to give a relative estimate of the typicality of a word's sounds in their respective positions.

In the example of "cat", P(K,1) = 0.094, P(AE,2) = 0.066, and P(T,3) = 0.062; so the average positional probability for the phonemes of "K.AE.T" is equal to 0.0739. Compare that to "hat", which has an average of 0.0551, and you can see that fewer words begin with H than K (since the other phonemes are identical and in the same positions).

Another interesting note is how the positional probabilities vary over phoneme positions. In the figures below, I am illustrating vowels versus consonants - and you can see that English words have a huge tendency to form CVC-patterns at the onset of words. Looking at the red arrow, you can see the V-shaped consonant probabilities in positions 1-3 and the mountain-shaped vowel probabilities distribution.


Positional probability also tends to become highly variable in later positions, something I noted in the last blog entry. You can see this pattern clearly in the consonants figure, as consonants in the later positions spike - on average. English words contain a lot of word-final consonants, which explains why the blue vowel line is gradually decreasing and the yellow cons line rises until the final spike.

Finally, the last bit of news: Version 2 of IPhOD is coming soon. This will represent a major overhaul of the database, including a new word frequency measure to take the place of Kucera-Francis written word frequency. It will be fascinating to see how the phonotactic and density measures change, or how much of a difference there will be when KF is no longer the basis for our estimates. Importantly, we expect the new measures to be more powerful predictors of behavior and brain activity. Shouldn't be too long now!

Tuesday, August 4, 2009

Making Changes?

Recently, I have heard suggestions for changes to the database and I wondered if this would have traction with readers. If you're interested in phonotactic research, but this database is missing some piece you consider critical - then let me know. (I'll see if I can add a nifty polling device to this blog for the question too.)

Q. What changes should we make to the IPhOD? (Why?)


Here are a couple of ideas that have really stuck with me so far:

1. Change: Kucera & Francis (1967) to something other frequencey metric; such as SUBTLEXus frequency (Brysbaert & New, in press) or a Google-based frequency (eg. Blair, Irene, Urland, Ma, 2002)?
Reason: KF is losing popularity in psycholinguistics as a measure of word frequency, it has a lot of baggage, and that makes KF-weighting more questionable. A switch could bring our frequency-weighted calculations up to date.
(RE: Mark Seidenberg comment on Talking Brains Blog)

2. Change: positional probability metric to be length constrained, as Vitevitch and Luce (2004) did with that measure computed by their Online Phonotactic Calculator.
Reason: As words get longer their average positional probability values start varying a lot. Since relatively fewer words of length 7-17 phonemes exist, probabilities are more variable in the later positions of long words. I would predict that this mainly affects longer words or pseudowords, but result in interesting changes for shorter items too.


References:

Brysbaert, M., New, B. (In Press). Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. BRM.

Blair, Irene, Urland, Ma. (2002). Using Internet search engines to estimate word frequency. BRM (34), 286-290.

Vitevitch, M. S., Luce, P. A. (2004). A Web-based interface to calculate phonotactic probability for words and nonwords in English. BRM 36(3), 481–487.