Index HTML content pages in Apache Nutch 2.x ( 2.2.1 )

Mohamed Meabed
2 min readOct 21, 2014

Index-html Plugin for apache nutch 2.x

I have looked almost all over the posts tutorials out there, I haven’t find complete module that works or clear approach! so i have build plugin that really WORKS !

Index HTML content of the pages in Apache Nutch 2.x ( 2.2.1 )

Plugin is hosted on Github repo !

— — — — — — — — — — — — — — — — — —

Instruction:

Compile from Source

Download the plugin folder “index-html” and copy it to you Apache nutch 2 plugin directory ( ex: apache-nutch-2.2.1/src/plugin ) Add the ( index-html ) plugin to The plugin folder build.xml ( apache-nutch-2.2.1/src/plugin/build.xml ) in target ( deploy and clean ) so the file will look like

<target name=”deploy”>…….<ant dir=”index-basic”target=”deploy”/><ant dir=”index-more”target=”deploy”/><ant dir=”index-html”target=”deploy”/><ant dir=”language-identifier”target=”deploy”/>………</target><target name=”clean”>…….<ant dir=”index-basic”target=”deploy”/><ant dir=”index-more”target=”deploy”/><ant dir=”index-html”target=”deploy”/><ant dir=”language-identifier”target=”deploy”/>………</target>

Run ( ant runtime ) in apache nutch 2 root folder to start the build You should have index-html.jar in build folder Enable the plugin by adding it to nutch-sites.xml ( or nutch-default.xml ) like beloe :

<configuration>……….<property><name>plugin.includes</name><value>………..someplugins….|index-html</value></property>……….</configuration>

The plugin will add new Field “rawcontent” to the Nutch Doc, To index this field you need to add it to ( scheme.xml or schema-solr4.xml ) like

<field name=”rawcontent”type=”text”sstored=”true”indexed=”true”multiValued=”false”/>

Run the crawler and you should see the new field rawcontent in index! Use Pre-Compiled Library

In The repo there is Build folder contain compiled .jar library ready for use. Copy the library to your runtime path local if you are running the plugin locally ( apache-nutch-2.2.1/runtime/local/plugins ) Then follow the above steps to configure nutch-sites.xml

Screen Shot

Originally published at www.meabed.net on September 29, 2014.

--

--

Mohamed Meabed

CTO — Polyglot Engineer • Tweeting #AI #ML #JavaScript #React #PHP • Building ⤵️ • 🖥 me.io • 🏢 magy.ai • 🧰️ dev.me