Apache Solr - SolrJ: Extract text using the request handler /update/extract

Apache Solr | SolrJ: Extract text using the request handler /update/extract

Notes
Example
Notes

Notes
Make sure to configure the request handler "/update/extract" in SolrConfigXml file.
```
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler" />
```
In order for the code bellow to work:

► Make sure to update the variables ("solrUrl", "collectionName", ...) with your information.

To force the commit, make sure to set the property "openSearcher" to true (SolrConfigXml file -> updateHandler -> autoCommit)

Note: You can also force the commit by running the URL: http://localhost:8983/solr/COLLECTION-NAME/update?commit=true

Example

Extract text using the request handler /update/extract:

final String UPDATE_EXTRACT_REQUEST_PATH = "/update/extract";

final String[] solrUrl = { "http://localhost:8983/solr" };

final String collectionName = "collection1";

// org.apache.solr.common.params.CollectionAdminParams.COLLECTION = "collection";

final CloudSolrClient cloudSolrClient = new CloudSolrClient.Builder(Arrays.asList(solrUrl)).build();

cloudSolrClient.setDefaultCollection(collectionName);

// extracting text [/update/extract] [ContentStreamUpdateRequest::addContentStream]
{
    final String contentStream = "extract this text ...";
    final String contentType = "text/plain;charset=UTF-8";

    final ByteArrayStream byteArrayStream = new ByteArrayStream(contentStream.getBytes(StandardCharsets.UTF_8), null);
    byteArrayStream.setContentType(contentType);

    final ModifiableSolrParams modifiableSolrParams = new ModifiableSolrParams();

    modifiableSolrParams.add(CollectionAdminParams.COLLECTION, collectionName);

    final ContentStreamUpdateRequest contentStreamUpdateRequest = new ContentStreamUpdateRequest(UPDATE_EXTRACT_REQUEST_PATH);

    contentStreamUpdateRequest.addContentStream(byteArrayStream);
    contentStreamUpdateRequest.setParams(modifiableSolrParams);
    contentStreamUpdateRequest.setMethod(METHOD.POST);

    final NamedList<Object> response = cloudSolrClient.request(contentStreamUpdateRequest);

    System.out.println(response);
}

cloudSolrClient.close();

This should create the following document:

<doc>
    <str name="id">222014cd-c96d-454c-a7f4-9fe53ea0b0bb</str>

    <long name="_version_">1621391017444900864</long>

    <arr name="stream_size">
        <str>null</str>
    </arr>

    <arr name="X-Parsed-By">
        <str>org.apache.tika.parser.DefaultParser</str>
        <str>org.apache.tika.parser.txt.TXTParser</str>
    </arr>

    <arr name="stream_content_type">
        <str>text/plain;charset=UTF-8</str>
    </arr>

    <arr name="Content-Encoding">
        <str>UTF-8</str>
    </arr>

    <arr name="Content-Type">
        <str>text/plain; charset=UTF-8</str>
    </arr>

    <arr name="content">
        <str>
            stream_size null
            X-Parsed-By org.apache.tika.parser.DefaultParser
            X-Parsed-By org.apache.tika.parser.txt.TXTParser
            stream_content_type text/plain;charset=UTF-8
            Content-Encoding UTF-8
            Content-Type text/plain; charset=UTF-8
            extract this text ...
        </str>
    </arr>
</doc>

Notes
- If you have a required unique key (Solr schema), you need to generate an auto value for the field (see an example bellow).
- You can configure the request handler to capture Tika attributes and saved them in specific fields.
  
  To save Tika attributes in a separate field "meta", add the following option to the request handler:
```
<str name="captureAttr">true</str>
```
  The "content" filed will hold, in this case, only the extracted text.
  
  To lower case the extracted fields/attributes, add the following option to the request handler:
```
<str name="lowernames">true</str>
```
  To save the extracted fields/attributes in separate fields, add the prefix "fmap." to the request handler:
```
<str name="fmap.meta">attr_</str>
```
To apply the notes mentioned above, adjust the "solrconfig.xm" file with the following:
```
<requestHandler name="/update/extract" startup="lazy" class="solr.extraction.ExtractingRequestHandler">
    <lst name="defaults">
        <str name="lowernames">true</str>
        <str name="captureAttr">true</str>

        <str name="fmap.meta">attr_</str>
        <str name="fmap.content">_text_</str>

        <str name="update.chain">genuuid</str>
    </lst>
</requestHandler>
```
```
<updateRequestProcessorChain name="genuuid">
    <processor class="solr.UUIDUpdateProcessorFactory">
        <str name="fieldName">id</str>
    </processor>

    <processor class="solr.LogUpdateProcessorFactory" />
    <processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
```
You also need to adjust the the solr schema and add a special dynamic field (*) to be able to index:
► Tika fields (x_parsed_by, ...)
► and Solr fields (stream_name, stream_source_info, stream_size, stream_content_type)
```
<dynamicField name="*" type="text_general" indexed="true" stored="true" multiValued="true" />
```
To apply these changes you need to reload the collections that uses the updated configuration (solr schema and config).
If your changes didn't apply properly, try to restart Solr. Otherwise check Solr logs in case you have some errors in your configuration.