Talk:Information retrieval

From Wikipedia, the free encyclopedia

1 Precision and Recall: Separate page
2 Cleaning
3 Precision and Recall
4 Major figures in information retrieval
5 F Measure
6 Evaluation of machine translation
7 Terminology according to ISO
8 History of IR
9 Break IR down into subfields
10 References would be nice
11 Confusion table
12 Glimpse / Webglimpse
13 MAP
14 New article request
- 14.1 Information scent
15 Term Discrimination
16 Relevance
17 External Links

[edit] Precision and Recall: Separate page

Searching for precision/recall, I was surprised to find them "buried" here in the IR page, and not described in a separate wiki article (as they are in German). Precision and Recall are widely used in different fields of computer science, not only IR. Therefore, I have created a precision/recall page, Precision and Recall mostly adapted from the German page Tobi Kellner (talk) 07:30, 21 November 2007 (UTC) PLEASE check and correct my Precision and Recall article!

[edit] Cleaning

Rather than clean those two paragraphs up again, I chose to revert to Mikkalai's version, which retained the corrections. If the missing paragraph is re-inserted, please make sure that the grammatical corrections are not overwritten. The two paragraphs I had cleaned up looked like they were written in another language and run through a translation program. They used English words and successfully communicated a concept, but were horrible from a readability standpoint. Please do not overwrite corrections that do not affect the substance of the material. SWAdair | Talk 06:15, 13 May 2004 (UTC)

I am planning on removing the lists of open source and other IR tools, perhaps incorporating some into the list of search engines. This is a downwards spiral trend of introducing spam into the article. WP:NOT a directory or place for commercial links. While I understand some may have good intentions, we can't keep some and then hide others and there are all types of issues. I might do this soon given the recent edits to this page. Please let me know of any objections. Josh Froelich 03:29, 7 January 2007 (UTC)

[edit] Precision and Recall

Current or tide:

P = (number of relevant documents retrieved) / (number of retrieved documents)
R = (number of retrieved documents) / (number of relevant documents)

Correct or tidy:

P = (number of relevant documents retrieved) / (number of documents retrieved)
R = (number of relevant documents retrieved) / (number of relevant documents stored)

Hopefully yours, --KYPark 01:15, 3 Jun 2005 (UTC)

Thank you. I've implemented the corrections you suggested. In general: If you feel a change is needed, feel free to make it yourself! Wikipedia is a wiki, so anyone (yourself included) can edit any article by following the Edit this page link. You don't even need to log in, although there are several reasons why you might want to. Wikipedia convention is to be bold and not be afraid of making mistakes. If you're not sure how editing works, have a look at How to edit a page, or try out the Sandbox to test your editing skills. New contributors are always welcome. --MarkSweep 03:40, 3 Jun 2005 (UTC)

[edit] Major figures in information retrieval

I wonder if having a subjective list of "major figures" is really a good idea... Sure, there are some recognizeable people in the field, but who decides who goes on the list and who doesn't? I have my own list of who I think are "major figures", and I'm sure there might be others who have a completely disjoint list. Just seems too subjective to me. --.msbmsb 19:21, 17 October 2005 (UTC)

[edit] F Measure

I've changed the formula for F measure, so that it uses the product of N squared and P, rather than the product of N squared and R, in the denominator. This brings the formula in line with that used by van Rijsbergen (as referenced in the article), and is consistent with the descriptions of F0.5 and F2 as given in the article.

[edit] Evaluation of machine translation

Precision/Recall are often used in the automatic evaluation of machine translation, indeed in a lot of NLP evaluation. - 88.96.32.193 13:48, 4 May 2006 (UTC)

[edit] Terminology according to ISO

I have talked with many people trained in science and engineering who are initially very confused by the IR terms "precision" and "recall". The confusion is caused by incompatible meanings for IR "precision" versus the other technical meanings of precision. Only when the terms are defined in context of IR do we realize "precision" and "recall" map to relevancy (a form of accuracy) and sensitivity (tests) respectively. In an encyclopedia, it would be considerate to make this clear early and often in discussions using IR's "precision" and "recall".

Please look at the following excerpt from Talk:Accuracy_and_precision#Terminology according to ISO.

The International Organization for Standardization (ISO) provides the following definitions.

Accuracy: The closeness of agreement between a test result and the accepted reference value.

Trueness: The closeness of agreement between the average value obtained from a large series of test results and an accepted reference value.

Precision: The closeness of agreement between independent test results obtained under stipulated conditions.

Reference: International Organization for Standardization. ISO 5725. Accuracy (trueness and precision) of measurement methods and results. Geneve: ISO, 1994.

Clark Mobarry 18:03, 12 May 2006 (UTC)

Sure, the distinction should be made clear, but I don't think it needs to be phrased as '"beware", this definition "conflicts" with others'... .msbmsb 18:52, 12 May 2006 (UTC)

[edit] History of IR

How about a (short) section discussing the history and development of IR. Stuff like the initial impetus for IR (census information etc.) IR establishing itself as its own field. Changing descriptions of IR over time... IR and the WWW...

It's all about the memex. 74.220.78.49 09:46, 19 September 2007 (UTC)

[edit] Break IR down into subfields

One thing I missed when I read this article was that information retrieval was not broken down into smaller subfields. It might be helpful to break this field down when developing the article further.

Nybbles 10:29, 17 November 2006 (UTC)

[edit] References would be nice

For those of us wanting to cite something other than this article it would give a nice starting point to give, say, the source of each equation.

65.93.206.3 07:43, 20 December 2006 (UTC)

[edit] Confusion table

I think the whole article would be much easier to follow in terms of true positives, etc. instead of retrieved documents, relevant documents, etc. --Ben ^T/_C 15:40, 21 May 2007 (UTC)

Absolutely not. These are standard terms. 74.220.78.49 09:55, 19 September 2007 (UTC)

[edit] Glimpse / Webglimpse

They are listed in open-source IR systems but according to their respective websites, their licenses do not seem to be open-source anymore. I don't know if they were open-source or not in the past, so I didn't remove them. Maybe someone with more knowledge on the subject could take care of the issue ? --Lastrainson 09:57, 5 September 2007 (UTC)

[edit] MAP

The following comments were posted on the article page:

shouldn't the denominator just be N instead of number of relevant documents? It seems natural to me that if you're summing N occurrences of P, that you would then divide by N. This change would make all the problems mentioned below go away

Please confirm/delete/edit the following.

One version of MAP I've seen is referred to as MAP @ N (where N is an arbitary retrieval cut-off, typical 5, 10, 20 etc).

In this case, the formula does not seem to be correct. For example -- consider first 4 ranks are relevant, the 5th is non-relevant. So AP @ 1 = 1, @ 2= 1, @ 3= 1 , @ 4 = 1 , and surprisingly it is still 1 @5 = (( 1+1+1+1 + 0)/ 4) This is clearly wrong. AP should be calculated even when the rel(r) is not true. Thus for this silly example, it would (1+1+1+1+0.8 ) / 5 . Which brings me to my second question -the numerator "Relevant Documents" is retrieved relevant, but actually again, for MAP @ N, it should be "retrieved documents (ie N). The above definition only seems to work when the LAST retrieved item is relevant.

Final comment - what does one do when |all relevant docs| < N AND |all retrieved relevant| == |all relevant documents|. This may happen with some specific test sets. My solution is to stop calculating MAP at the rank of the last, highest relevant document retrieved, and reporting that MAP for all following MAP @ x statistics.

Depending on who owns this page, perhaps some Matlab or R code would be nice.

The denominator in the average precision formula is the number of relevant documents because effectively there are exactly that many terms in the sum (rel(r) = 0 for nonrelevant documents).

The (mean) average precision formula considers the list of retrieved documents only up to the last relevant document. It doesn't care if there are nonrelevant documents after the it. So if the four relevant documents are on ranks 1-4 the AP is always 1 regardless of possible nonrelevant documents after the rank 4. AnAj 07:26, 13 October 2007 (UTC)

[edit] New article request

Hi there, I'd like to suggest a new article on the full history of information handling/management/techonology (details). I'm not knowledgeable enough to do it myself, but contributors here probably are. Thanks, JackyR | Talk 18:04, 4 December 2007 (UTC)

[edit] Information scent

Also add an article about information scent, as also noted on information foraging. Jidanni (talk) 18:35, 18 February 2008 (UTC)

[edit] Term Discrimination

I am going to start a new page on TermDiscrimination. I am a noob, so can some one give me some pointers on how to get this article linked to? Dspattison (talk) 18:29, 7 January 2008 (UTC)

[edit] Relevance

I just overhauled the Relevance (information_retrieval) entry after an unsuccessful attempt to get it deleted and merged into this one. Please take a look and help improve it and appropriately link it to this one.

Dtunkelang (talk) 14:30, 26 May 2008 (UTC)

[edit] External Links

Should we move the "5 Open source systems" and "6 Other retrieval tools" sections somewhere else? I assume those are the sections triggering the external links warning.

Dtunkelang (talk) 04:41, 4 June 2008 (UTC)