Thursday, May 16, 2019

Good question!

Time for a random IPhOD blog post. Here are four good (recently asked) questions about IPhOD, version 2.0. Feel free to ask more, if you are using the database in your research!


Question 1. Multiple values for the same word?

I searched IPHOD version 2.0 and noticed that it generates different word forms and values for the same input word. For example, here are the results when I searched for the word "artichoke":
Word
NPhon
unsDENS
unsBPAV
artichoke
7
1
0.0030323
artichoke
7
2
0.003574

What's up?!

Answer 1. Multiple pronunciations (you need to show CMU transcriptions)
The IPHOD returns more than one output when the CMU contained multiple pronunciations for that word. In addition to returning the most common pronunciation, I decided to include multiple pronunciations of the same word. The results shown above (by default) do not indicate which values belong to each pronunciation. However, if you click the “CMU transcription” box on the search page, then you can see the CMU pronunciations for those 2 entries:
Word
UnTrn
NPhon
unsDENS
unsBPAV
artichoke
AA.R.T.AH.CH.OW.K
7
1
0.0030323
artichoke
AA.R.T.IH.CH.OW.K
7
2
0.003574

You say ar-TIH-choke, and I say ar-TUH-choke...



Question 2. Where did the IPHOD come from, with respect to the CMU Pronouncing Dictionary?

Answer 2. CMUPD version 0.7a (2008).
The last update of the IPHOD was in November 2009, and I used the latest version of the CMU pronouncing dictionary (version 0.7a was released in 2008). The IPHOD (2.0) contains words that had both CMU transcriptions and SUBTLEX word frequencies. There are fewer words because there are so many words in each database that do not appear in the other. Information on how homophones, homographs, and multiple pronunciations were handled is described within the background information, here: www.iphod.com/details



Question 3. How can you tell which nonwords have a high or low value?

Answer 3. High and low values are relative to a baseline that should be thoughtfully determined.

Relatively high or low probability nonwords can simply be selected as two groups with values that are consistently higher or lower than each other, no overlap. The easiest approach for constructing such a stimuli list would be to form two samples that have no overlap in their probabilities to ensure there is a different range for their values. This approach has limitations since all of the items could have low probabilities compared to the larger distribution of all English words.

Probably a better strategy is to compare a sample of interest to the larger population of IPhOD words, since the values are calculated uniformly for all items. Start by downloading the IPhOD words textfile and using Excel or some other program to establish means and standard deviations (SD) for your value of interest. For the low probability group, you could arbitrarily set an upper limit of Mean – 0.5 SD, and only select nonwords with values below that threshold. A lower limit of Mean + 0.5 SD would then be set for the high probability group, likewise only select nonwords above that value. You might consider medians instead of means, if the distribution of values is highly skewed. It might also be important to restrict the reference distribution for your calculations to IPhOD words with a similar CV structure or number of syllables/phonemes to your nonwords. Restricting the basis set of words can be complicated, also it is possible to limit a reference distribution too much through these restrictions.

Another important consideration is that these values are inter-correlated. For example, a word consisting of high probability phoneme pairs is likely to also come from a dense neighborhood. Words with CVC structures will have a different distribution of densities and phonotactic probabilities than words with CVCC structures. Sometimes you have to consider several variables at once, and hopefully the IPHOD tools online are sufficient for that.

Depending on your experience with excel and/or statistics, it may be a good idea to have someone with statistics training to assist in the selection of words and nonwords.



Question 4. Where did these neighbor counts come from?

Answer 4. You can check the count, by viewing a list of neighbors.
The IPHOD (2.0) calculator has a button "show neighbors" that allows you to see all of the neighbors included in counts or weighted sums. So if you wanted to calculate things differently and don't want to start from scratch, then you can produce your own lists.

Wednesday, March 28, 2012

New IPhOD v2.0 Calculator

How do you obtain the phonotactic frequency or neighborhood density of a word that is not included in the latest version of IPhOD, or obtain a list of its phonological neighbors that are labelled according to homophones or homographs?

Try the new IPhOD version 2.0 online calculator! The added feature calculates word-averaged phonotactic probabilities, neighborhood densities, and positional probabilities based on user-entered CMU transcriptions (Weide, 1994). This addresses an earlier limitation of the IPhOD version 2.0: only being able to search among words included in the ~50,000 word collection (or for a preset list of pseudowords). Prior to this update, if you needed to calculate values for new word or pseudoword transcriptions then you had to use values from version 1.4, which used different frequency weightings and a smaller word collection than 2.0.

The new calculator based on the IPhOD sequel also includes more options. The IPhOD version 2.0 calculator webpage (navigate to Calculator, then Version 2.0, or click here) allows the user to enter a list of CMU transcriptions (e.g. "K.AE.T" for "cat"), which can be used to generate any of the following values that the user selects:

1-4. Neighborhood density,
5-8. Biphoneme probability,
9-12. Triphoneme probability,
13-15. Positional probability,
16-20. Length-constrained probability.

Each of those selections comes with four options for weighted calculations: unweighted raw counts, SubtlexUs frequency weighted (S Freq), log10 S Freq, and S Context-Dependent Counts. The word frequency weights are from SUBTLEXus (Brysbaert & New, 2009). At this time, I do not have plans to include syllable stress constraints in the calculator.


Finally, if you click on the "Show neighbors?" button before running the calculator, then the calculator produces a cool-looking, second table that appears below the traditional calculator output results.

The table contains every 1-phoneme different phonological neighbor with asterisks denoting words with homographs or homophones. Although the calculator adjusts to avoid double-weighting words whose spellings are identical, and only counts homophone neighbors as one neighbor (since they are indistinguishable by sound), it may be helpful for users to have that information available.

Pretty neat, huh?

If you start fiddling with the new features and either break something, raise questions, or discover a disabled button, please contact me. I have done my best to quality-check the new online calculator, primarily by comparisons of output from the online search and online calculator (v 2.0).

Also, feel free to send me your suggestions for future extensions. The show neighbors function was a product of this blog, which has been quite an interesting and helpful development for me too.

I hope that these new features will be a helpful resource for other researchers!

Brysbaert, M. and New, B. 2009. Moving beyond Kucera and Francis: a critical evaluation of current word frequency norms and the introduction of a new and improved word frequency measure for American English. Behavior Research Methods, 41, 997-990.

Weide, Robert L. 1994. CMU Pronouncing Dictionary. http://www.speech.cs.cmu.edu/cgi-bin/cmudict.

Saturday, February 18, 2012

Word Associations Around the World

Science needs you help! 5 minutes, so you don't even have to log off Facebook!

I heard about an ongoing study that needs English speakers to briefly volunteer for a language experiment on the Internet: http://www.smallworldofwords.com/tests. The word association task only takes 5 minutes and involves naming words that are related to a few words that you are presented with.

The goal of the webpage by Dr. Gert Storms and Dr. Simon De Deyne is to collect word association data from thousands of English speaking adults around the world in order to help explain word meaning and semantic relationships among words. The data collected from this Internet experiment will also provide a resource to cognitive scientists and other language researchers.

Saturday, November 19, 2011

English non-word, German word



This is a picture from a 2008 trip to Switzerland, where an ad meaninglessly (to me) stated: "bleib cool, mann". These were all over the place, which is amusing when you don't know exactly what is being advertised.

Since then, I have accepted the straightforward Internet translations of "stay cool, man" (or "man that stayed cool"?). Either way, one of my favorite IPhOD pseudowords ("BLEEB") is quite similar to a high frequency German word. Enligsh non-words in your experiments can appear as meaningful words elsewhere in the world (or German classes). This may be an unavoidable issue in using pseudowords in research, but there are things that an experimenter can do to identify items such as these. Get to know your participants, ask about language-background, and take careful notes. (e.g. How many of your English-speaking students that apparently had superior pseudoword recall were taking language classes?)

Monday, January 10, 2011

Searching IPhOD 2.0 Online

As of January 11, 2011, the newest release of IPhOD (version 2.0) can be searched online. Most users that I've communicated with prefer the online search to an offline manipulation of text files, so this should encourage researchers to begin making the switch from version 1.4. The search webpages and utilities were modified minimally, so if you've already learned to search the database then the newest version of IPhOD a snap to explore.

Online searches that were performed on IPhOD prior to January 11, 2011 used version 1.4. I preserved the search utilities for the previous release (1.4), so if you're halfway finished with your experimental stimuli - have no worries! In the future, I would suggest that users note which version (2.0, 1.4, etc.) that was used in your research.

As introduced previously, the latest release (2.0) contains several new measures, including a new word frequency metric that is more reliable than 1.4. There is a larger number of real words, and I included homographs and homophones in this version. Although there was an expansion of options and words in IPhOD 2.0, many calculations have a global similarity (e.g. when considering 30K+ shared words) in terms of measured associations between old and new calculations. The raw count based calculations are highly similar between versions, while the frequency-weighted values deviate somewhat - as would be expected based on improved word counts. Interestingly, there are strong associations between the log KF (v1.4) and log10 SF (v2.0) -weighted values, indicating that the normalizing effect of log-based transformations might obliviate ~ 50 years of change in word usage, within limits.

One drawback to using IPhOD 2.0: the online calculator tool currently only uses base values from IPhOD 1.4. If you cannot find a word or pseudoword while searching IPhOD 2.0, then you cannot spell it out in glyphs to obtain value estimates. While that limits your options, the version 2.0 contains over 50,000 entries, so my priority was to make the online search tool available first. The next project is to adapt my calculator code to generate version 2.0 based values.

My hope is that these expanded search tools will aid speech research and applications. As always, I welcome your feedback. I'll try to resolve any technical issues as quickly as possible.

Good luck!

Tuesday, December 1, 2009

IPhOD 2

The sequel has arrived!

IPhOD version 2.0 can now be downloaded from iphod.com. I have improved the documentation on the website and I'm hoping to get version 2.0 search scripts and calculators up before the end of December. Following those final changes, the webpage and database should be stable for some time. If you have a chance, let me know what you think about the new organization of the webpage.


Here are the finer aspects of Version 2:

1. A better frequency measure.
I switched from Kucera-Francis to SUBTLEXus (Brysbaert & New), as the latter have done a marvelous job demonstrating the inadequacies of KF in explaining variance in psychological data. Furthermore, they have presented a good case that counts based on movie subtitles are a far superior option. I decided to include the frequency measure and context dependent measure, too. The impact of changing the frequency measure is that all of the frequency-weighted measures will likely be altered by having fresher and improved written counts.

2. More words.
There are 54k words in version 2.0, up from 33k. This means that all sorts of search terms may come up when you browse its contents. However, it also means that the phonotactic and density calculations in IPhOD now are calculated on an even wider base. Since many of the probability measures are relative frequency (divided by terms summed over all words, phoneme pairs, etc.) - more words should improve the accuracy.

3. Homophones and Homographs.
They're both included in the new database. I treated them differently when counting words, summing frequencies of homophones but counting them once for raw counts; while counting all homographs but adding their (written) counts once whenever they came up in counts.

4. Length-Constrained Positional Probability.
I have been contemplating this for a while, so I included a new measure that constrained the phoneme-positional counts to words of equal phoneme length. I'll have to explore this one later in more detail, but the intuition is that LCPP should be less variable across words of different phoneme lengths. Since length constraints group counts together, the effects of later positions could be smaller... However, my first peeks at the data found that those pesky regular CV structures and word-final consonant clusters probably will keep LC from being the final solution to the positional problem.

All of my excitement reflects just a few new files on the "Download" section. However, these changes should improve the ability of speech researchers (including me) to control and select words according to phonological factors. It will be terrific to find out if it really does!

Monday, November 23, 2009

the word or pseudoword "perved"

Don't worry! IPhOD has not been hacked again - the title alludes to a comment by an anonymous grad student who participated in one of my experiments recently. 

One item presented during the pseudoword detection task was pronounced perved, as if there were a past tense of the colloquial abbreviation for the noun: pervert. These sorts of pseudowords are always a hot item with subjects, since these sorts of items may evoke any number of representations or processes in the course of making the tough call - did I just hear a word or a pseudoword? When you are using pseudowords that are generated to highly resemble English words, it is inevitable that subjects will hear slang or potential slang.

A brief debate followed the experiment. I argued with the participant that perved is definitely not an English word, plain and simple - you cannot apply past tense to a noun in this way. More importantly, we are not including pseudoword trials in the analysis - so the effect of such items is to challenge subjects and hopefully keep them engaged in the otherwise boring task.

However, it only took one hour in New York city to be proven wrong. Almost immediately after that conversation, I went out for lunch and along the way, a strange looking man walked by me, making unusual faces and articulating some disturbing noises. I sped up my pace and got out of there! Based on this incident though, it seems that you can actually be perved by someone. While the pseudoword in question seems meaningless in Southern California, perved appears to be a legitimate, meaningful word in the Big Apple! 

It is all about context, so choose your pseudowords carefully! Hopefully this guy wasn't my next subject...