Indexing Rich Documents in Solr
This week I integrated Apache Tika into Moodle to support indexing of Rich Documents like .PDF
, .DOC
, .PPT
etc. Solr’s ExtractingRequestHandler uses Tika allowing users to upload binary files to Solr and have Solr extract text from it and then index it, making them searchable.
One has to send the file to Solr via HTTP POST
. The following cURL
request does the work:
curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F "myfile=@ps.pdf"
ps.pdf
file is sent to Solr to extract content from it.literal.id=1
: assigns theid=1
to the Solr Document thus created.commit=true
: Commits the changes to the solr index.myfile=@ps.pdf
: This needs to be a valid relative or absolute path.
Refer the wiki for more options on ExtractingRequestHandler
Now, using the PECL PHP SOLR
client in Moodle, there isn’t a way to get the extracted content and add it to solr document’s field. The cURL
request creates an all-new Solr Document specifically for the files and adds content to that Solr Document’s fields.
Also, the get_content_file_location()
function of Moodle that stores the absolute filepath of files is protected.
So, keeping the above things in mind I had to come up with the following logic for including the feature of indexing Rich Documents via ExtractingRequestHandler
in Global Search.
The access rights will be checked by extracting the $id
of the Solr Document and passing it to the forum’s access check function. Full code.
And, here’s the code that I’ve written for the Forum Module.
The above code sends the external files to Solr for extracting content and creating new Solr Documents. I’m not committing the Documents after each cURL
request as it would take a lot of time. Hence, after all the documents have been added, I’m execute $client->commit
at the end.