Talk:Tf–idf
From Wikipedia, the free encyclopedia
Usually, the term frequency is just the count of a term in a document (NOT divided by the total number of terms in the document), which is confusing because it isn't really a frequency.
I strongly agree, in all the technical papers I've been reading for my Internet services class at U.Washington, TF is the count, and so TF*IDF is biased (usually has higher values) for longer documents therefore needing to be normalized.
Contents |
[edit] lowercase
Why title of the article is in lower case? Why not "TF-IDF"? --ajvol 15:29, 25 November 2006 (UTC)
- I believe the short story of this is that tf-idf is a well known function in the literature and that is how it is referred. I know that in some cases it is used to help differentiate it from the uppercase variations that are sometimes used to refer to other equations. Josh Froelich 03:19, 11 December 2006 (UTC)
-
-
-
- In other papers I see the sign of multiplication TF*IDF, not minus. See, e.g. S. Robertson, Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 60, 503-520, 2004.
What do you think about renaming the article? --AKA MBG (talk) 14:18, 7 March 2008 (UTC)
- In other papers I see the sign of multiplication TF*IDF, not minus. See, e.g. S. Robertson, Understanding inverse document frequency: on theoretical arguments for IDF. Journal of Documentation 60, 503-520, 2004.
-
-
[edit] Example
You could extract the most relevant terms from a version of a page of Wikipedia, perhaps this very one, as an example. --84.20.17.84 15:20, 15 March 2007 (UTC)
[edit] Text Data Clustering
- I think we can also use tf-idf in text data clustering. I would like to know any Java source code on unstructured text data clustering based on tf-idf? —Preceding unsigned comment added by 125.53.215.245 (talk) 03:09, 12 September 2007 (UTC)
[edit] Normalized frequencies
The frequency of the terms isn't usually normalized by dividing it for the total length of each document. Instead, normalization is done by dividing for the frequency of the most used term in the document (as outlined in http://www.miislita.com/term-vector/term-vector-4.html). —Preceding unsigned comment added by 151.53.133.126 (talk) 18:59, 29 February 2008 (UTC)
[edit] Logarithms
Can someone please specify the logarithm bases correctly? Is that binary or base 10 log? —Preceding unsigned comment added by Godji (talk • contribs) 12:03, 20 March 2008 (UTC)
- it doesn't matter as long as they are all the same in your calculations 24.222.83.249 (talk) 23:42, 1 June 2008 (UTC)

