Tuesday, December 1, 2009

IPhOD 2

The sequel has arrived!

IPhOD version 2.0 can now be downloaded from iphod.com. I have improved the documentation on the website and I'm hoping to get version 2.0 search scripts and calculators up before the end of December. Following those final changes, the webpage and database should be stable for some time. If you have a chance, let me know what you think about the new organization of the webpage.


Here are the finer aspects of Version 2:

1. A better frequency measure.
I switched from Kucera-Francis to SUBTLEXus (Brysbaert & New), as the latter have done a marvelous job demonstrating the inadequacies of KF in explaining variance in psychological data. Furthermore, they have presented a good case that counts based on movie subtitles are a far superior option. I decided to include the frequency measure and context dependent measure, too. The impact of changing the frequency measure is that all of the frequency-weighted measures will likely be altered by having fresher and improved written counts.

2. More words.
There are 54k words in version 2.0, up from 33k. This means that all sorts of search terms may come up when you browse its contents. However, it also means that the phonotactic and density calculations in IPhOD now are calculated on an even wider base. Since many of the probability measures are relative frequency (divided by terms summed over all words, phoneme pairs, etc.) - more words should improve the accuracy.

3. Homophones and Homographs.
They're both included in the new database. I treated them differently when counting words, summing frequencies of homophones but counting them once for raw counts; while counting all homographs but adding their (written) counts once whenever they came up in counts.

4. Length-Constrained Positional Probability.
I have been contemplating this for a while, so I included a new measure that constrained the phoneme-positional counts to words of equal phoneme length. I'll have to explore this one later in more detail, but the intuition is that LCPP should be less variable across words of different phoneme lengths. Since length constraints group counts together, the effects of later positions could be smaller... However, my first peeks at the data found that those pesky regular CV structures and word-final consonant clusters probably will keep LC from being the final solution to the positional problem.

All of my excitement reflects just a few new files on the "Download" section. However, these changes should improve the ability of speech researchers (including me) to control and select words according to phonological factors. It will be terrific to find out if it really does!

Monday, November 23, 2009

the word or pseudoword "perved"

Don't worry! IPhOD has not been hacked again - the title alludes to a comment by an anonymous grad student who participated in one of my experiments recently. 

One item presented during the pseudoword detection task was pronounced perved, as if there were a past tense of the colloquial abbreviation for the noun: pervert. These sorts of pseudowords are always a hot item with subjects, since these sorts of items may evoke any number of representations or processes in the course of making the tough call - did I just hear a word or a pseudoword? When you are using pseudowords that are generated to highly resemble English words, it is inevitable that subjects will hear slang or potential slang.

A brief debate followed the experiment. I argued with the participant that perved is definitely not an English word, plain and simple - you cannot apply past tense to a noun in this way. More importantly, we are not including pseudoword trials in the analysis - so the effect of such items is to challenge subjects and hopefully keep them engaged in the otherwise boring task.

However, it only took one hour in New York city to be proven wrong. Almost immediately after that conversation, I went out for lunch and along the way, a strange looking man walked by me, making unusual faces and articulating some disturbing noises. I sped up my pace and got out of there! Based on this incident though, it seems that you can actually be perved by someone. While the pseudoword in question seems meaningless in Southern California, perved appears to be a legitimate, meaningful word in the Big Apple! 

It is all about context, so choose your pseudowords carefully! Hopefully this guy wasn't my next subject...

Tuesday, November 17, 2009

hacked!

If you've visited IPhOD in the last 2 weeks or so, then you probably noticed that it looks pretty strange and bad. The company that hosts my webpage informed me by email that someone used a SQL hack to break into the webpage and trash it. The beautiful calculator functions that I wrote allowed the hacker to feed some code into the SQL server and then take over my site. This was terrible news since I had put so much time into the webpage and I am currently preparing a writeup. On the other hand ... this seemed like a good time to debut IPhOD version 2.0.

Since the new database is completely prepared and I must rebuild the webpage anyway, it makes sense to go ahead and distribute the newest version. The principle difference between version 1.4 and 2.0 is that I am using the SUBTLEX database word frequencies instead of Kucera-Francis word frequencies (the latter is widely seen as an inferior frequency measure). Another important change is the inclusion of homographs and homophones, and data columns that will reveal whether an entry has more than one pronunciation or spelling. I will be blogging a separate post on the changes when I release the newest version, so stay tuned!

IPhOD will continue to be available as a download with PERL scripts that I wrote to search its contents and calculate new values. I will try to get the calculators operating again -- but I have to patch up the vulnerability that the host identified. Since I am a programmer but not a computer scientist, this may take a while.

Your Opinion?

Here is where I ask for your opinions and tips: Is there anything that you think I should organize differently? From users: any new features that you think would be more valuable for searching the database? If you are a hacker or web programmer, is there a simple way to prevent SQL attacks from user-submitted forms? Any general comments are appreciated too!

I await your comments!

Friday, August 14, 2009

IPhOD v1.4 (correcting an error in v1.3)

I recently found that IPhOD version 1.3 (Feb 2005) contained errors in the positional probability measures; it resulted in average positional probability calculations (columns 39-44) being 3-4 times higher than they should have been. As of August 14, 2009, all of the positional probability calculations in Version 1.4 have been corrected and updated throughout the IPhOD website, so whether you download a copy or search online, you will now be getting the corrected estimate of positional probability.

The good news is that the calculation error did not affect any of the other measures, and positional probability is not typically used independently of other measures.

Seems like this might be a good time to talk about positional probability a little more. Positional probability is a measure that is often used to help control or manipulate sublexical processing, along with biphoneme probability. The measure is calculated by counting the number of times that a phoneme occurs in a specific position, then dividing by the counts of all phonemes in that position. So in the word "cat", P(K,1), P(AE,2), and P(T,3) are the positional probabilities that must be determined. Once those are known, they are averaged to give a relative estimate of the typicality of a word's sounds in their respective positions.

In the example of "cat", P(K,1) = 0.094, P(AE,2) = 0.066, and P(T,3) = 0.062; so the average positional probability for the phonemes of "K.AE.T" is equal to 0.0739. Compare that to "hat", which has an average of 0.0551, and you can see that fewer words begin with H than K (since the other phonemes are identical and in the same positions).

Another interesting note is how the positional probabilities vary over phoneme positions. In the figures below, I am illustrating vowels versus consonants - and you can see that English words have a huge tendency to form CVC-patterns at the onset of words. Looking at the red arrow, you can see the V-shaped consonant probabilities in positions 1-3 and the mountain-shaped vowel probabilities distribution.


Positional probability also tends to become highly variable in later positions, something I noted in the last blog entry. You can see this pattern clearly in the consonants figure, as consonants in the later positions spike - on average. English words contain a lot of word-final consonants, which explains why the blue vowel line is gradually decreasing and the yellow cons line rises until the final spike.

Finally, the last bit of news: Version 2 of IPhOD is coming soon. This will represent a major overhaul of the database, including a new word frequency measure to take the place of Kucera-Francis written word frequency. It will be fascinating to see how the phonotactic and density measures change, or how much of a difference there will be when KF is no longer the basis for our estimates. Importantly, we expect the new measures to be more powerful predictors of behavior and brain activity. Shouldn't be too long now!

Tuesday, August 4, 2009

Making Changes?

Recently, I have heard suggestions for changes to the database and I wondered if this would have traction with readers. If you're interested in phonotactic research, but this database is missing some piece you consider critical - then let me know. (I'll see if I can add a nifty polling device to this blog for the question too.)

Q. What changes should we make to the IPhOD? (Why?)


Here are a couple of ideas that have really stuck with me so far:

1. Change: Kucera & Francis (1967) to something other frequencey metric; such as SUBTLEXus frequency (Brysbaert & New, in press) or a Google-based frequency (eg. Blair, Irene, Urland, Ma, 2002)?
Reason: KF is losing popularity in psycholinguistics as a measure of word frequency, it has a lot of baggage, and that makes KF-weighting more questionable. A switch could bring our frequency-weighted calculations up to date.
(RE: Mark Seidenberg comment on Talking Brains Blog)

2. Change: positional probability metric to be length constrained, as Vitevitch and Luce (2004) did with that measure computed by their Online Phonotactic Calculator.
Reason: As words get longer their average positional probability values start varying a lot. Since relatively fewer words of length 7-17 phonemes exist, probabilities are more variable in the later positions of long words. I would predict that this mainly affects longer words or pseudowords, but result in interesting changes for shorter items too.


References:

Brysbaert, M., New, B. (In Press). Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. BRM.

Blair, Irene, Urland, Ma. (2002). Using Internet search engines to estimate word frequency. BRM (34), 286-290.

Vitevitch, M. S., Luce, P. A. (2004). A Web-based interface to calculate phonotactic probability for words and nonwords in English. BRM 36(3), 481–487.

Tuesday, July 14, 2009

First dispatch from the IPhOD

Welcome to the IPhOD Blog.

The goal of IPhOD Blog is to facilitate informal discussion, questions, and suggestions for the Irvine Phonotactic Online Dictionary (IPhOD). This will also be a good tool to communicate with IPhOD users, should new developments occur with any of the tools available on the website.

I will be moderating to be sure that this doesn't turn into something totally different; but please feel free to post your comments here. In particular, I would like to hear ideas about phonological measures that would be useful in psycholinguistic experiments. Or maybe there is an alternative to this database that IPhOD's audience should know about, I'd be happy to add that to this page or the main page. I welcome your comments.

And now for the IPhOD news:

1. I migrated servers from the original donated space to a far ritzier one. This was important for several reasons, but most importantly - I was able to bring back the online search utilities that originally made the site so appealing for psycholinguistic researchers. They are a lot more user-friendly than the downloadable PERL files and spreadsheets.

2. I reorganized the input screens to make them easier to manipulate and understand than the original search functions. I also corrected a very pesky output error that caused the search to fail to display unweighted neighborhood density, even if you clicked the button. If you used IPhOD 2-3 years ago, these changes will be immediately apparent. If not, you'll still like the changes, trust me.

3. I also programmed a new PHP-based online calculator that allows users to estimate unstressed neighborhood density and biphoneme probability for any transcripts (in CMUPD glyphs) that the subject enters. So if you have words or pseudowords that are not included in the IPhOD, then you may still be able to get the values that you need.

I loook forward to sharing the latest developments with IPhOD users and hearing from you!

Cheers,

Kenny Vaden

Department of Cognitive Sciences
University of California, Irvine