API Access to Extracted Text

Get help with module writing or platform code changes.

Moderator: carrier

API Access to Extracted Text

Postby skusa108 » Thu Dec 11, 2014 5:03 pm

Does anyone have guidance on accessing the extracted text through the Autopsy API? Specifically, I'm writing a report module and would like to include the extracted text for files included in the report, when available.

I'm using the Python API, so for example, I'm grabbing a file by ID:

Code: Select all
item = sleuthkitCase.getAbstractFileById(fileID)


From there, I'm current using the getCrtimeAsDate(), getAtimeAsDate(), getMtimeAsDate() functions to retrieve the object's metadata. But I'd also like to get at the extracted text, and I don't see any obvious way to do so. Could be that I'm just completing missing something though.

Any help is appreciated!
skusa108
 
Posts: 1
Joined: Thu Dec 11, 2014 3:49 pm

Re: API Access to Extracted Text

Postby carrier » Wed Jan 07, 2015 10:41 pm

The best reference for this is this file:

https://github.com/sleuthkit/autopsy/bl ... arkup.java

Specifically, the getSolrContent() method.
Overall, the code needs to look _something_ like this:

Code: Select all
Server solrServer = KeywordSearch.getServer();
String content = solrServer.getSolrContent(currentContent, chunkId);


(though in Python and not Java)

This is not well documented, but probably should be. We have a bunch of modules that do this type of thing.

You can get the number of chunks with this method:

solrServer.queryNumFileChunks(contentID);

A chunk is a page. Large files are broken up into smaller pages.
carrier
 
Posts: 45
Joined: Thu May 15, 2014 3:31 pm


Return to Autopsy Developers Corner

Who is online

Users browsing this forum: No registered users and 1 guest