Wednesday, March 28, 2012

New IPhOD v2.0 Calculator

How do you obtain the phonotactic frequency or neighborhood density of a word that is not included in the latest version of IPhOD, or obtain a list of its phonological neighbors that are labelled according to homophones or homographs?

Try the new IPhOD version 2.0 online calculator! The added feature calculates word-averaged phonotactic probabilities, neighborhood densities, and positional probabilities based on user-entered CMU transcriptions (Weide, 1994). This addresses an earlier limitation of the IPhOD version 2.0: only being able to search among words included in the ~50,000 word collection (or for a preset list of pseudowords). Prior to this update, if you needed to calculate values for new word or pseudoword transcriptions then you had to use values from version 1.4, which used different frequency weightings and a smaller word collection than 2.0.

The new calculator based on the IPhOD sequel also includes more options. The IPhOD version 2.0 calculator webpage (navigate to Calculator, then Version 2.0, or click here) allows the user to enter a list of CMU transcriptions (e.g. "K.AE.T" for "cat"), which can be used to generate any of the following values that the user selects:

1-4. Neighborhood density,
5-8. Biphoneme probability,
9-12. Triphoneme probability,
13-15. Positional probability,
16-20. Length-constrained probability.

Each of those selections comes with four options for weighted calculations: unweighted raw counts, SubtlexUs frequency weighted (S Freq), log10 S Freq, and S Context-Dependent Counts. The word frequency weights are from SUBTLEXus (Brysbaert & New, 2009). At this time, I do not have plans to include syllable stress constraints in the calculator.


Finally, if you click on the "Show neighbors?" button before running the calculator, then the calculator produces a cool-looking, second table that appears below the traditional calculator output results.

The table contains every 1-phoneme different phonological neighbor with asterisks denoting words with homographs or homophones. Although the calculator adjusts to avoid double-weighting words whose spellings are identical, and only counts homophone neighbors as one neighbor (since they are indistinguishable by sound), it may be helpful for users to have that information available.

Pretty neat, huh?

If you start fiddling with the new features and either break something, raise questions, or discover a disabled button, please contact me. I have done my best to quality-check the new online calculator, primarily by comparisons of output from the online search and online calculator (v 2.0).

Also, feel free to send me your suggestions for future extensions. The show neighbors function was a product of this blog, which has been quite an interesting and helpful development for me too.

I hope that these new features will be a helpful resource for other researchers!

Brysbaert, M. and New, B. 2009. Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 997-990.

Weide, Robert L. 1994. CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

Saturday, February 18, 2012

Word Associations Around the World

Science needs you help! 5 minutes, so you don't even have to log off Facebook!

I heard about an ongoing study that needs English speakers to briefly volunteer for a language experiment on the Internet: http://www.smallworldofwords.com/tests. The word association task only takes 5 minutes and involves naming words that are related to a few words that you are presented with.

The goal of the webpage by Dr. Gert Storms and Dr. Simon De Deyne is to collect word association data from thousands of English speaking adults around the world in order to help explain word meaning and semantic relationships among words. The data collected from this Internet experiment will also provide a resource to cognitive scientists and other language researchers.

Saturday, November 19, 2011

English non-word, German word



This is a picture from a 2008 trip to Switzerland, where an ad meaninglessly (to me) stated: "bleib cool, mann". These were all over the place, which is amusing when you don't know exactly what is being advertised.

Since then, I have accepted the straightforward Internet translations of "stay cool, man" (or "man that stayed cool"?). Either way, one of my favorite IPhOD pseudowords ("BLEEB") is quite similar to a high frequency German word. Enligsh non-words in your experiments can appear as meaningful words elsewhere in the world (or German classes). This may be an unavoidable issue in using pseudowords in research, but there are things that an experimenter can do to identify items such as these. Get to know your participants, ask about language-background, and take careful notes. (e.g. How many of your English-speaking students that apparently had superior pseudoword recall were taking language classes?)

Monday, January 10, 2011

Searching IPhOD 2.0 Online

As of January 11, 2011, the newest release of IPhOD (version 2.0) can be searched online. Most users that I've communicated with prefer the online search to an offline manipulation of text files, so this should encourage researchers to begin making the switch from version 1.4. The search webpages and utilities were modified minimally, so if you've already learned to search the database then the newest version of IPhOD a snap to explore.

Online searches that were performed on IPhOD prior to January 11, 2011 used version 1.4. I preserved the search utilities for the previous release (1.4), so if you're halfway finished with your experimental stimuli - have no worries! In the future, I would suggest that users note which version (2.0, 1.4, etc.) that was used in your research.

As introduced previously, the latest release (2.0) contains several new measures, including a new word frequency metric that is more reliable than 1.4. There is a larger number of real words, and I included homographs and homophones in this version. Although there was an expansion of options and words in IPhOD 2.0, many calculations have a global similarity (e.g. when considering 30K+ shared words) in terms of measured associations between old and new calculations. The raw count based calculations are highly similar between versions, while the frequency-weighted values deviate somewhat - as would be expected based on improved word counts. Interestingly, there are strong associations between the log KF (v1.4) and log10 SF (v2.0) -weighted values, indicating that the normalizing effect of log-based transformations might obliviate ~ 50 years of change in word usage, within limits.

One drawback to using IPhOD 2.0: the online calculator tool currently only uses base values from IPhOD 1.4. If you cannot find a word or pseudoword while searching IPhOD 2.0, then you cannot spell it out in glyphs to obtain value estimates. While that limits your options, the version 2.0 contains over 50,000 entries, so my priority was to make the online search tool available first. The next project is to adapt my calculator code to generate version 2.0 based values.

My hope is that these expanded search tools will aid speech research and applications. As always, I welcome your feedback. I'll try to resolve any technical issues as quickly as possible.

Good luck!

Tuesday, December 1, 2009

IPhOD 2

The sequel has arrived!

IPhOD version 2.0 can now be downloaded from iphod.com. I have improved the documentation on the website and I'm hoping to get version 2.0 search scripts and calculators up before the end of December. Following those final changes, the webpage and database should be stable for some time. If you have a chance, let me know what you think about the new organization of the webpage.


Here are the finer aspects of Version 2:

1. A better frequency measure.
I switched from Kucera-Francis to SUBTLEXus (Brysbaert & New), as the latter have done a marvelous job demonstrating the inadequacies of KF in explaining variance in psychological data. Furthermore, they have presented a good case that counts based on movie subtitles are a far superior option. I decided to include the frequency measure and context dependent measure, too. The impact of changing the frequency measure is that all of the frequency-weighted measures will likely be altered by having fresher and improved written counts.

2. More words.
There are 54k words in version 2.0, up from 33k. This means that all sorts of search terms may come up when you browse its contents. However, it also means that the phonotactic and density calculations in IPhOD now are calculated on an even wider base. Since many of the probability measures are relative frequency (divided by terms summed over all words, phoneme pairs, etc.) - more words should improve the accuracy.

3. Homophones and Homographs.
They're both included in the new database. I treated them differently when counting words, summing frequencies of homophones but counting them once for raw counts; while counting all homographs but adding their (written) counts once whenever they came up in counts.

4. Length-Constrained Positional Probability.
I have been contemplating this for a while, so I included a new measure that constrained the phoneme-positional counts to words of equal phoneme length. I'll have to explore this one later in more detail, but the intuition is that LCPP should be less variable across words of different phoneme lengths. Since length constraints group counts together, the effects of later positions could be smaller... However, my first peeks at the data found that those pesky regular CV structures and word-final consonant clusters probably will keep LC from being the final solution to the positional problem.

All of my excitement reflects just a few new files on the "Download" section. However, these changes should improve the ability of speech researchers (including me) to control and select words according to phonological factors. It will be terrific to find out if it really does!

Monday, November 23, 2009

the word or pseudoword "perved"

Don't worry! IPhOD has not been hacked again - the title alludes to a comment by an anonymous grad student who participated in one of my experiments recently. 

One item presented during the pseudoword detection task was pronounced perved, as if there were a past tense of the colloquial abbreviation for the noun: pervert. These sorts of pseudowords are always a hot item with subjects, since these sorts of items may evoke any number of representations or processes in the course of making the tough call - did I just hear a word or a pseudoword? When you are using pseudowords that are generated to highly resemble English words, it is inevitable that subjects will hear slang or potential slang.

A brief debate followed the experiment. I argued with the participant that perved is definitely not an English word, plain and simple - you cannot apply past tense to a noun in this way. More importantly, we are not including pseudoword trials in the analysis - so the effect of such items is to challenge subjects and hopefully keep them engaged in the otherwise boring task.

However, it only took one hour in New York city to be proven wrong. Almost immediately after that conversation, I went out for lunch and along the way, a strange looking man walked by me, making unusual faces and articulating some disturbing noises. I sped up my pace and got out of there! Based on this incident though, it seems that you can actually be perved by someone. While the pseudoword in question seems meaningless in Southern California, perved appears to be a legitimate, meaningful word in the Big Apple! 

It is all about context, so choose your pseudowords carefully! Hopefully this guy wasn't my next subject...

Tuesday, November 17, 2009

hacked!

If you've visited IPhOD in the last 2 weeks or so, then you probably noticed that it looks pretty strange and bad. The company that hosts my webpage informed me by email that someone used a SQL hack to break into the webpage and trash it. The beautiful calculator functions that I wrote allowed the hacker to feed some code into the SQL server and then take over my site. This was terrible news since I had put so much time into the webpage and I am currently preparing a writeup. On the other hand ... this seemed like a good time to debut IPhOD version 2.0.

Since the new database is completely prepared and I must rebuild the webpage anyway, it makes sense to go ahead and distribute the newest version. The principle difference between version 1.4 and 2.0 is that I am using the SUBTLEX database word frequencies instead of Kucera-Francis word frequencies (the latter is widely seen as an inferior frequency measure). Another important change is the inclusion of homographs and homophones, and data columns that will reveal whether an entry has more than one pronunciation or spelling. I will be blogging a separate post on the changes when I release the newest version, so stay tuned!

IPhOD will continue to be available as a download with PERL scripts that I wrote to search its contents and calculate new values. I will try to get the calculators operating again -- but I have to patch up the vulnerability that the host identified. Since I am a programmer but not a computer scientist, this may take a while.

Your Opinion?

Here is where I ask for your opinions and tips: Is there anything that you think I should organize differently? From users: any new features that you think would be more valuable for searching the database? If you are a hacker or web programmer, is there a simple way to prevent SQL attacks from user-submitted forms? Any general comments are appreciated too!

I await your comments!