From grdetil@scrc.umanitoba.ca Fri Aug 23 16:25:00 2002 Date: Fri, 23 Aug 2002 15:29:38 -0500 (CDT) From: Gilles Detillieux To: Ted Stresen-Reuter Cc: "ht://Dig mailing list" Subject: Re: [htdig] pdf-parser According to Ted Stresen-Reuter: > On a related note, is there any way to customize the TITLE attribute > htsearch displays for pdfs? We have over 100 MB of pdfs we index every night > and it would be VERY helpful to be able to provide more accurate titles in > the search results. Well, the best way is to edit the PDF description information, in Acrobat Exchange, to set the title. That way, the conv_doc.pl or doc2html.pl script will pick it up automatically, via pdfinfo. Failing that, the other option is to put a hook into your Perl script to read the alternate title for a given URL from a file. Here's how I did it in conv_doc.pl, for some PDFs of scientific papers... --- contrib/conv_doc.pl.orig Thu Jul 12 09:38:29 2001 +++ contrib/conv_doc.pl Thu Oct 18 12:23:58 2001 @@ -71,6 +71,7 @@ $CATPDF = "/usr/bin/pdftotext"; $PDFINFO = "/usr/bin/pdfinfo"; #$CATPDF = "/usr/local/bin/pdftotext"; #$PDFINFO = "/usr/local/bin/pdfinfo"; +$titlelist = "/home/httpd/html/SCRC/manuscripts/titles.lst"; ######################################### # @@ -183,6 +183,23 @@ if ($ishtml) { print "\n\n"; # print out the title, if it's set, and not just a file name, or make one up +if (-r $titlelist) { + if (open(INFO, "grep \"$ARGV[2]\" $titlelist 2>$null |")) { + while () { + if (/^$ARGV[2]/) { + s/^$ARGV[2]\s+//; + s/\s+$//; + s/\s+/ /g; + s/&/\&\;/g; + s//\>\;/g; + $title = $_; + last; + } + } + close INFO; + } +} if ($title eq "" || $title =~ /^[A-G]:[^\s]+\.[Pp][Dd][Ff]$/) { @parts = split(/\//, $ARGV[2]); # get the file basename $parts[-1] =~ s/%([A-F0-9][A-F0-9])/pack("C", hex($1))/gie; Here, for example, is a line from titles.lst: http://www.scrc.umanitoba.ca/SCRC/manuscripts/41.pdf Spinal circuitry of sensorimotor control of locomotion -- Gilles R. Detillieux E-mail: Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- This sf.net email is sponsored by: OSDN - Tired of that same old cell phone? Get a new here for FREE! https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390 _______________________________________________ htdig-general mailing list To unsubscribe, send a message to with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html