From grdetil@scrc.umanitoba.ca Mon Sep 10 14:00:25 2001 Date: Mon, 10 Sep 2001 15:21:39 -0500 (CDT) From: Gilles Detillieux To: sn@ParlaNet.de Cc: "ht://Dig mailing list" Subject: Re: [htdig] yet another pdf parser According to Stefan Nehlsen: > On Thu, Sep 06, 2001 at 04:06:24PM -0500, Gilles Detillieux wrote: > > According to Stefan Nehlsen: > > > > However, since 3.1.4 was released, the use of external parsers isn't > > usually recommended, as external converters do a better job. > > When I started to play with htdig (a year ago ?) parse_doc.pl was the > one to use. I stick with it for for a long time and started to fix > bugs (german umlaute) to make it work in the way it should. Well, 3.1.4 was released in December 1999, but I've gotten more emphatic in the past year about recommending external converters over external parsers. This was because of a surprising number of "bug reports" and requests for help that directly resulted from limitations in parse_doc.pl. > We have quite a lot of large pdf-files here and I wanted to put > links to pseudo-anchors (#page= style) into the excerpt. Changing > parse_doc.pl doesn't to seem to be fun and so I started to rewrite it. > > Another problem is that most of our pdf-files doesn't contain nice > title information. I use the perl-script to generate it from > urls that are structured in some parts of our content and to merge > external stored titles for an other part. Yes, these are all good additions, but they could go into an external converter just about as easily as in an external parser. The #page=n anchors could be output as tags, and titles, however they are obtained by your script, can go between and tags in the external converter's HTML output. If anything, the external converter scripts tend to be simpler because the parsing bits are left out. > It is still htdig 3.1.5 and still alpha but you may have a look at it: > > http://www.parlanet.de/_index_parlanet.html > > ( Stay on the german version (others don't really exist) and search > for "castor". ) > > The design is not really good because it is using frames and so I had > to use a php-wrapper and made a small change to htsearch. (patch is > attached -- please ignore it :-) Actually, I think it's a good patch! We've had requests before for some way of putting target attributes in the anchor links, and your approach seems like a clean one. I'll post it, and consider it for the next release. > Biggest problem now is performance - I've got to get bigger hardware. > > Will version 3.2.x be faster? Well, as they say, your mileage may vary. Technically, there are a few improvements in the new database structuring that should speed things up. However, the word database also tends to be larger, and so there seems to be a slowdown of some searches. As a whole, things should be somewhat faster as long as you don't rely on the new and as-yet unoptimized phrase searching feature. With 3.2 as with 3.1, though, you need good hardware to support a large search index. > > See > > > > ftp://ftp.htdig.org/pub/htdig/contrib/parsers/doc2html.tar.gz > > > > for the latest and fanciest incarnation of these. > > I found it out that was doing to much things I didn't need. Why should I > use the same script for different types when htdig is able to > choose the right one. It depends on what you're indexing. Sure, for PDFs, they are usually tagged unambiguously by the server, so htdig can pick the right converter/parser. The trick is the .doc files, which may be WP, Word, RTF, or something else, so having one script that looks at both the "magic number" at the start of the document as well as the server's returned Content-Type header can be a real benefit. > I was thinking about embedding a perl interpreter into htdig > when I was reading it. :-) > > ( Maybe I should put some comments into my program. ) > > > The big problem > > with external parsers is they don't parse words consistently in the > > manner that the internal parsers do, and they don't respond to changes > > in the config file. E.g., if you drop minimum_word_length from 3 to > > 2, you still won't get 2-letter words from the external parser because > > of the hardcoded 3 in there. It also won't look at valid_punctuation, > > extra_word_characters, or any other attribute that controls parsing. > > ok -- this is true -- but this was not really my problem. > > htdig is really great, it is working quite well and everytime I look at > it I find new features to try out. Glad to hear it. Here's a repost of your patch, for the benefit of the mailing list. Apply in 3.1.5's main source directory using "patch -p0 < this-message". --- htcommon/defaults.cc.org Wed Aug 29 11:05:37 2001 +++ htcommon/defaults.cc Wed Aug 29 11:07:12 2001 @@ -31,6 +31,7 @@ {"allow_in_form", ""}, {"allow_numbers", "false"}, {"allow_virtual_hosts", "true"}, + {"anchor_target", ""}, {"authorization", ""}, {"backlink_factor", "1000"}, {"bad_extensions", ".wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi"}, --- htsearch/Display.cc-ORG Tue Aug 21 14:40:57 2001 +++ htsearch/Display.cc Wed Aug 29 11:29:45 2001 @@ -1199,6 +1199,7 @@ { static char *start_highlight = config["start_highlight"]; static char *end_highlight = config["end_highlight"]; + static char *anchor_target = config["anchor_target"]; static String result; int pos; int which, length; @@ -1211,8 +1212,12 @@ result.append(str, pos); ww = (WeightWord *) (*searchWords)[which]; result << start_highlight; - if (first && fanchor) - result << ""; + if (first && fanchor) { + result << ""; + } result.append(str + pos, length); if (first && fanchor) result << ""; -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list To unsubscribe, send a message to with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html