|View Issue Details|
|ID||Project||Category||View Status||Date Submitted||Last Update|
|0012319||mantisbt||attachments||public||2010-09-05 15:36||2014-10-12 18:34|
|Target Version||Fixed in Version|
|Summary||0012319: Index attachments' content|
|Description||I'd like to have a plugin for (full-text) indexing attachments' (.doc, .odt, .pdf) content.|
a) use a separate Apache Lucene instance with some (RESTful HTTP?) interface.
b) use Apache Tika parser with PostgreSQL tsearch2 full-text indexer.
a) seems hard work and maybe heavyweight (Java servlet running on some servlet engine)
b) is waay easier - at least when you're already using PostgreSQL under your Mantis...
|Tags||attachment, feature, fts, plugin, postgresql, wish|
I'd need some suggestions: onine or offline indexing of uploaded files?
If offline, then should I call the "java -jar tika-app.jar" directly from PHP, or should that be run from some cron script?
Any other ideas?
This is a big undertaking.
I think you'd ideally want to perform indexing on a cron job cycle at low IO/CPU priority (ionice + renice). By calling an indexing command every time a file is uploaded you could potentially end up with multiple CPU intensive processes running at a time on your server. With a cron job you have much better control over what times of the day the intensive CPU workload is performed and how many CPUs should be used concurrently.
Of course, this would make it Linux-only which is a potential downside. Although saying that, it is a plugin, and someone could create a Windows specific version of this plugin if they wanted to. Or they could contribute patches later to add Windows support to the plugin you're proposing.
I'm a little concerned about how this will work when we support many different database types. I guess you could just make a full text search plugin specific for PostgreSQL, etc but then you'd be limiting the number of users who can use your plugin.
Not to mention the multiple different ways in which attachments can be stored:
1) On a remote FTP server
2) As a file within the uploads/files directory
3) Within the database as big blobs
This absolutely a WIP, but things works now:
- extract with antiword/unzip/pdftotext OR tika
- indexing backend: PostgreSQL's TSearch2 OR Xapian
- indexing in a cronjob (uses file_api's file_get_content, so storage method is indifferent).
So indexing works, but usage (embed in "View Issues" page) is missing (hopefully next week), and configuration needs more work, too.
|Now search works, but why do I need to set $g_plugin_current = 'AttachmentIndexer' (plugin's name) every time? (not just from the cron job, but from IndexerFilter.class.php, too).|
|Attached a working (at least with TSearch2) version, without tika-app-0.7.jar (17MB).|
Since mantisforge doesn't accept my push efforts, uploaded it to
|2010-09-05 15:36||gthomas||New Issue|
|2010-09-05 15:39||gthomas||Note Added: 0026579|
|2010-09-05 15:40||gthomas||Tag Attached: plugin|
|2010-09-05 15:40||gthomas||Tag Attached: attachment|
|2010-09-05 15:40||gthomas||Tag Attached: feature|
|2010-09-05 15:40||gthomas||Tag Attached: fts|
|2010-09-05 15:40||gthomas||Tag Attached: postgresql|
|2010-09-05 15:40||gthomas||Tag Attached: wish|
|2010-09-19 02:58||dhx||Note Added: 0026782|
|2010-09-19 02:59||dhx||Note Added: 0026783|
|2010-09-19 03:47||gthomas||Note Added: 0026786|
|2010-09-19 15:50||gthomas||Note Added: 0026788|
|2010-09-20 05:28||gthomas||File Added: attachmentindexer-WIP-0.1.2.tbz2|
|2010-09-20 05:28||gthomas||Note Added: 0026796|
|2010-09-25 07:49||gthomas||Note Added: 0026857|
|2014-02-02 11:25||atrol||Severity||tweak => feature|
|2014-10-12 18:34||grangeway||Product Version||git trunk => 1.2.17|