Monday, August 8, 2011

Review: Tulloss (1997) Assessment of Similarity Indices for undesirable properties and a new Tripartite Similarity Index based on cost functions, pp.122-143, In: Palm, Chapela (Eds.) Mycology in Sustainable Development: Expanding concepts, vanishing borders. Parkway Publishers, Boone NC USA.

Feature Paper: DOWNLOAD * Tulloss (1997) Assessment of Similarity Indices for undesirable properties and a new Tripartite Similarity Index based on cost functions, pp.122-143, In: Palm, Chapela (Eds.) Mycology in Sustainable Development: Expanding concepts, vanishing borders. Parkway Publishers, Boone NC USA.
Author Abstract: Comparison of lists is a common element of many studies including ethnomycological, ecological, and mycological investigations. The items on the lists might be species in a habitat, uses of a given organism by indigenous people, character states present in an individual fungus, or lists of unusual spellings in segments of the Dead Sea Scrolls. Often, it is desirable to express the similarity of two related lists by some formula (a similarity index). Such an index might be used in summarizing data otherwise presented or as input to further numerical processing, such as the creation of a dendrogram (Pankhurst, 1991:54). In examining several works using formulae to provide a single number expressing the similarity of the contents of two lists, a number of difficulties with the formulae were noted. For example, for some indices the same value was generated for two or more quite different situations, e.g., one in which a pair of lists were nearly identical, and another in which one list was much larger than the other. This problem came up during review of material for the present book, thus motivating the present chapter. The purpose of this chapter is to motivate, describe, and offer an implementation for, a working similarity index that avoids the difficulties noted for the others.
Note to Readers: Follow links above for author email, full article text, or the publishing scientific journal. Author notes in my review are in quotes.
Review: Today we'll finish up our review of taxonomic similarity indices with the "Tripartite Similarity Index." This analysis was developed by using mathematical formulas from outside of the biological sciences and applying them towards biology, which is an increasingly common theme as scientists across multiple disciplines are starting to communicate.
The author of today's paper developed a new similarity index because he was dissatisfied with other similarity indices for a number of reasons (summarized in his abstract above). Through his comparison, he reviewed 20 "existing and commonly used similarity indices" and determined that "no problem-free index was found in the list." However, through the review of "manufacturing engineering" cost function metrics, the author was able to create an index that solved the problems of all reviewed indices.
I won't go through all 20 similarity indices reviewed (see the original paper for that), but I highly recommend reviewing the original paper for a thorough explanation of all the pros and cons of each similarity index and considerations that all researchers should have before comparing lists for similarity.
However, for ease, I'll list the 10 indices the author spent a longer time reviewing and debunking, but leave it to the reader to look to the full paper for explanations of all the limitations involved in each index:
  1. Simpson Coefficient
  2. Second Kulczynski Coefficient
  3. Ochiai / Otsuka Coefficient
  4. Dice Coefficient
  5. Jaccard Coefficient
  6. Sokal and Sneath Coefficient
  7. First Kulczynski Coefficient
  8. Mountford Coefficient
  9. Braun-Blanquet Coefficient
  10. Fager and McGowan Coefficient
In developing a new similarity index, the author found 8 requirements missing (usually in part, as nearly all similarity indices qualified some of the requirements) from the indices reviewed:
  1. "A similarity index shall be sensitive to the relative size of the two lists to be compared; and great difference in size shall be interpreted to reduce the value of the similarity index."
  2. "A similarity index shall be sensitive to the size of the sublist shared by a pair of lists; and an increase in difference in size between the smaller of the two lists and the sublist of common entries shall be interpreted to reduce the value of the similarity index."
  3. "A similarity index shall be sensitive to the percentage of entries in the larger list that are in common between the lists and to the percentage of entries in the smaller list that are in common between the two lists and shall increase as these two percentages increase."
  4. "A similarity index shall yield values having fixed upper and lower bounds."
  5. "A similarity index shall have the property that when two lists are identical, the similarity index for the two lists shall be equal to the upper bound of the index."
  6. "A similarity index shall have the property that when two lists have no entries in common, the similarity index for the lists shall be equal to the lower bound of the index."
  7. "Distribution of values of a similarity index between zero and one shall be such that (a) if the size of two input lists is fixed, then the output shall vary roughly directly as the number of entries shared between the lists; and (b) if the smaller list is a subset of the larger list, then the value of the similarity index shall vary roughly inversely as the size of the larger list."
  8. "A similarity index program shall check its input data to verify that the following relationships hold: a + b > 0 and a + c > 0."
Through satisfying all 8 requirements listed above, the author came up with the following formula composed of 3 individual components (hence the name Tripartite Similarity Index): 
T = √(U x S x R) where the following subformulae apply:
W06p03-fig1
The author cautions however that while the Tripartite Similarity Index does satisfy all requirements other authors have described for similarity indices, the T-values produced are a bit abstract and can't be thought of as simple percentages of similarity. Instead, the author notes that "Our primary hope is that our intuitions about a loosely defined property of points in the three dimensional space (similarity) is reflected in the position of a corresponding point on a line."
So now that you have a new formula for comparing lists of data, have fun computing some T-values and compare how the plots appear related to other more commonly taught indices, such as the Simpson, Jaccard, and Braun-Blanquet similarity indices.
Next we'll have two papers dealing with various patterns of species diversity, with links to papers below.

No comments:

Post a Comment