Date: Fri, 15 Nov 2002 11:50:43 +0300 From: Alexander I. Lebedev To: htdig-general@lists.sourceforge.net Subject: [htdig] changes to Display.cc: increase of relevance Hi! I've found the results obtained by HTDig with non-zero backlink_factor very obscure and even misleading. I would like to propose another algorithm for calculating the relevance. Suppose someone is looking for a solution of a problem. At first s/he needs to know different approaches to the problem, so the Web pages providing the _choice_ (i.e. pages having more outgoing links) appears more important than the others. On the other hand, the pages having more incoming links (going from other pages) are more important too. This is why I propose to calculate the backlink_score of a document according the formula: backlink_score = 1 + backlink_factor * (number_of_incoming_links + number_of_outgoing_links); score = score * backlink_score; with the backlink_factor about 0.04 (it doubles the backlink_score for a page with 25 links). Note, when calculating the score I use multiplication instead of addition. Another point. I propose to correct also the algorithm for the date_score calculation because the current algorithm even gave me _negative_ values of the date_score for some my documents (!?). In my opinion, exponential decay with a characteristic decay time of 3 years can be a reasonable function describing the loss of actuality of a Web page (it corresponds to a fivefold decrease of date_score for a 5-year-old document). Note again that for calculation of the score I use multiplication instead of addition. (I think the calculation of an exponent takes no more than 1 microsecond for modern processors, so it's quite fast). Probably, the proposed algorithms are not perfect, but if you test it, you'll find them more relevant than the previous ones (at least from the end user's point of view). - Alexander -------------------------- Here is a simple patch to the Display.cc file (version 3.2.0b4-20021110): --- Display.cc.orig Sat Jul 27 03:48:19 2002 +++ Display.cc Thu Nov 14 21:43:19 2002 @@ -1420,27 +1420,27 @@ // Other changes to the score can happen now // Or be calculated by the result match in getScore() - // This formula derived through experimentation - // We want older docs to have smaller values and the - // ultimate values to be a reasonable size (max about 100) - base_score = score; + const double year = 31556925.97; // number of seconds in a year if (date_factor != 0.0) { - date_score = date_factor * - ((thisRef->DocTime() * 1000.0 / (double)now) - 900); - score += date_score; +// AIL: exponential decay with time: older docs have smaller date_score. +// date_factor=0.3 results in a fivefold decrease of the date_score for +// a 5-years-old document. + date_score = exp(- date_factor * + (double)(now - (thisRef->DocTime())) / year); + score *= date_score; } if (backlink_factor != 0.0) { int links = thisRef->DocLinks(); - if (links == 0) - links = 1; // It's a hack, but it helps... - - backlink_score = backlink_factor - * (thisRef->DocBackLinks() / (double)links); - score += backlink_score; +// AIL: new strategy: more links -- more informative page +// backlink_factor=0.04 results in a twofold increase of the backlink_score +// for a document with 25 links. + backlink_score = (1.0 + backlink_factor * + (thisRef->DocBackLinks() + (double)links)); + score *= backlink_score; } if (debug) { ------------------------------------------------------- This sf.net email is sponsored by: To learn the basics of securing your web site with SSL, click here to get a FREE TRIAL of a Thawte Server Certificate: http://www.gothawte.com/rd524.html _______________________________________________ htdig-general mailing list To unsubscribe, send a message to with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html