Skip to content
Antriksh edited this page Oct 11, 2015 · 4 revisions

Installation

  1. Before building nutchpy from source, make sure you have the following setup:
  2. Get the source by cloning the repository using the following command.
    git clone https://github.com/ContinuumIO/nutchpy.git
  3. Then run the following commands to run the setup.py script. (Make sure to have the super user permission while running the setup script)
     cd nutchpy
     sudo python setup.py install

Usage

The nutchpy setup by default comes with 2 simple and easy to understand examples. It's basic usage is as follows:

 import nutchpy

 node_path = "<FULL-PATH-TO-CRAWLED-DATA>/data"
 seq_reader = nutchpy.sequence_reader
 print(seq_reader.head(n,node_path)) # Prints first n rows from the file
 print(seq_reader.slice(start,stop,node_path)) # Prints lines between start and stop
 data = seq_reader.read(node_path)
 print(data) # Prints the whole file content
  • node_path - It is generally the path to the crawled data file. Typically on a nutch default installation, it'd look something like nutch/runtime/local/crawl/crawldb/current/part-00000/data To process the entire data and to run through the urls, read the content. The content is in the form of a list. The below sample runs through all the urls.
     import nutchpy

     path = 'path-to-nutch/nutch/runtime/local/crawl/crawldb/current/part-00000/data'

     data = nutchpy.sequence_reader.read(path)
     for list_item in data:
          print(list_item[0]) # Prints the url
          print(list_item[1]) # Prints details abt the url

A sample output of 1 row of the above code would be as follows

https://www.abc.com
Version: 7
Status: 1 (db_unfetched)
Fetch time: Sat Sep 26 23:52:36 PDT 2015
Modified time: Wed Dec 31 16:00:00 PST 1969
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.0
Signature: null
Metadata: 
 	_repr_=https://www.abc.com
	_pst_=moved(12), lastModified=0: https://www.abc.com
	Content-Type=text/html
	_rs_=115

Using the above sample program, one can get all the details of the crawled database. We can get the status of the urls, whether it is fetched or not. We can also get the reason, as to why it failed and also mime-types of different fetched files.

Troubleshoot

The program may result in the following error sometimes.

py4j.protocol.Py4JJavaError: An error occurred while calling z:com.continuumio.seqreaderapp.SequenceReader.slice.
: java.lang.OutOfMemoryError: Java heap space
	at java.util.Arrays.copyOf(Arrays.java:3181)
	at java.util.ArrayList.grow(ArrayList.java:261)
	at java.util.ArrayList.ensureExplicitCapacity(ArrayList.java:235)
	at java.util.ArrayList.ensureCapacityInternal(ArrayList.java:227)
	at java.util.ArrayList.add(ArrayList.java:458)
	at com.continuumio.seqreaderapp.SequenceReader.slice(SequenceReader.java:143)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:497)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
	at py4j.Gateway.invoke(Gateway.java:259)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:207)
	at java.lang.Thread.run(Thread.java:745)

This happens because the size of the data set is too huge for py4j to process in one go. To overcome this issue, the slice and head methods should be used. A sample program that runs through the entire data using slice is given below. The below program parses through the data 1000 items at a time.

import nutchpy

# Parses through the data and do the processing on it
def parseData(data):
    # return false if there is no new data to parse through
    if not data:
        return False
    
    # return true if there is more data 
    return True

i = 0            
path = 'path-to-nutch/nutch/runtime/local/crawl/crawldb/current/part-00000/data'
dataPresent = True
# Parsing through the data 1000 at a time, otherwise the system wont be able to handle such huge data
while dataPresent:
    data = nutchpy.sequence_reader.slice(i, i + 1000, path)
    dataPresent = parseData(data)
    i = i + 1000
Clone this wiki locally