View Issue Details Jump to Notes ] Wiki ] Issue History ] Print ]
IDProjectCategoryView StatusDate SubmittedLast Update
0012319mantisbtattachmentspublic2010-09-05 15:362014-10-12 18:34
Assigned To 
PlatformOSOS Version
Product Version1.2.17 
Target VersionFixed in Version 
Summary0012319: Index attachments' content
DescriptionI'd like to have a plugin for (full-text) indexing attachments' (.doc, .odt, .pdf) content.
Additional InformationPossibilities:
  a) use a separate Apache Lucene instance with some (RESTful HTTP?) interface.
  b) use Apache Tika parser with PostgreSQL tsearch2 full-text indexer.

a) seems hard work and maybe heavyweight (Java servlet running on some servlet engine)
b) is waay easier - at least when you're already using PostgreSQL under your Mantis...
Tagsattachment, feature, fts, plugin, postgresql, wish
Attached Files? file icon attachmentindexer-WIP-0.1.2.tbz2 [^] (7,299 bytes) 2010-09-20 05:28

- Relationships

-  Notes
User avatar (0026579)
gthomas (reporter)
2010-09-05 15:39

I'd need some suggestions: onine or offline indexing of uploaded files?
If offline, then should I call the "java -jar tika-app.jar" directly from PHP, or should that be run from some cron script?

Any other ideas?
User avatar (0026782)
dhx (developer)
2010-09-19 02:58

This is a big undertaking.

I think you'd ideally want to perform indexing on a cron job cycle at low IO/CPU priority (ionice + renice). By calling an indexing command every time a file is uploaded you could potentially end up with multiple CPU intensive processes running at a time on your server. With a cron job you have much better control over what times of the day the intensive CPU workload is performed and how many CPUs should be used concurrently.

Of course, this would make it Linux-only which is a potential downside. Although saying that, it is a plugin, and someone could create a Windows specific version of this plugin if they wanted to. Or they could contribute patches later to add Windows support to the plugin you're proposing.

I'm a little concerned about how this will work when we support many different database types. I guess you could just make a full text search plugin specific for PostgreSQL, etc but then you'd be limiting the number of users who can use your plugin.
User avatar (0026783)
dhx (developer)
2010-09-19 02:59

Not to mention the multiple different ways in which attachments can be stored:

1) On a remote FTP server

2) As a file within the uploads/files directory

3) Within the database as big blobs
User avatar (0026786)
gthomas (reporter)
2010-09-19 03:47

This absolutely a WIP, but things works now:
  - extract with antiword/unzip/pdftotext OR tika
  - indexing backend: PostgreSQL's TSearch2 OR Xapian
  - indexing in a cronjob (uses file_api's file_get_content, so storage method is indifferent).

So indexing works, but usage (embed in "View Issues" page) is missing (hopefully next week), and configuration needs more work, too.

User avatar (0026788)
gthomas (reporter)
2010-09-19 15:50

Now search works, but why do I need to set $g_plugin_current[0] = 'AttachmentIndexer' (plugin's name) every time? (not just from the cron job, but from IndexerFilter.class.php, too).
User avatar (0026796)
gthomas (reporter)
2010-09-20 05:28

Attached a working (at least with TSearch2) version, without tika-app-0.7.jar (17MB).
User avatar (0026857)
gthomas (reporter)
2010-09-25 07:49

Since mantisforge doesn't accept my push efforts, uploaded it to [^]

- Issue History
Date Modified Username Field Change
2010-09-05 15:36 gthomas New Issue
2010-09-05 15:39 gthomas Note Added: 0026579
2010-09-05 15:40 gthomas Tag Attached: plugin
2010-09-05 15:40 gthomas Tag Attached: attachment
2010-09-05 15:40 gthomas Tag Attached: feature
2010-09-05 15:40 gthomas Tag Attached: fts
2010-09-05 15:40 gthomas Tag Attached: postgresql
2010-09-05 15:40 gthomas Tag Attached: wish
2010-09-19 02:58 dhx Note Added: 0026782
2010-09-19 02:59 dhx Note Added: 0026783
2010-09-19 03:47 gthomas Note Added: 0026786
2010-09-19 15:50 gthomas Note Added: 0026788
2010-09-20 05:28 gthomas File Added: attachmentindexer-WIP-0.1.2.tbz2
2010-09-20 05:28 gthomas Note Added: 0026796
2010-09-25 07:49 gthomas Note Added: 0026857
2014-02-02 11:25 atrol Severity tweak => feature
2014-10-12 18:34 grangeway Product Version git trunk => 1.2.17

MantisBT 1.2.17 [^]
Copyright © 2000 - 2014 MantisBT Team
Time: 0.1133 seconds.
memory usage: 3,052 KB
Powered by Mantis Bugtracker