Skip to content

Scraping HTML

Tim L edited this page Jul 17, 2015 · 46 revisions

::sigh::

What is first

First, a nice article about just using the web as an API.

Other's work:

What we will cover

This page lists some XSL utility functions that we've developed to scrape HTML:

Let's get to it

The following functions help scrape HTML elements into useful strings. It uses the following namespace.

xmlns:html="http://www.w3.org/1999/xhtml"

We prefer to just produce a CSV from the HTML, instead of trying to model it in RDF directly. There are much nicer mechanisms in csv2rdf4lod to handle URI creation within the SDV paradigm. We write a row of CSV using the following.

   <xsl:value-of select="concat($DQ,string-join((
                                                 $perigee,$apogee,$inclination,$period,$semi-major-axis,
                                                ),
                                                concat($DQ,',',$DQ)),$DQ,$NL)"/>

Example inputs

Darpa

http://www.darpa.mil/OpenCatalog/index.html circa Feb 2014

<tr>
  <td>Aptima Inc.</td>
  <td>
     <a href='http://www.darpa.mil/External_Link.aspx?url=https://github.com/Aptima/pattern-matching'>Network
Query by Example</a>
  </td>
  <td>Analytics</td>
  <td>2014-07</td>
  <td>https://github.com/Aptima/pattern-matching.git</td>
  <td>
     <a href='stats/pattern-matching/index.html'>stats</a>
  </td>
  <td>Hadoop MapReduce-over-Hive based implementation of network
query by example utilizing attributed network pattern
matching.</td>
  <td>ALv2</td>
</tr>
Visual Analytics Benchmark Repository

http://hcil2.cs.umd.edu/newvarepository/benchmarks.php

html:text

Definition:

<!-- https://github.com/timrdf/csv2rdf4lod-automation/wiki/Scraping-HTML#htmltext -->
<xsl:function name="html:text">
   <xsl:param name="node"/>
   <xsl:variable name="together">
      <xsl:for-each select="$node//text()">
         <xsl:value-of select="normalize-space(.)"/>
      </xsl:for-each>
   </xsl:variable>
   <xsl:value-of select="normalize-space($together)"/>
</xsl:function>

Usage:

<xsl:template match="html:tr">
   <xsl:value-of select="concat(html:text(html:td[1]),$NL)"/>
</xsl:template>

Adding a parameter for a delimiter:

<xsl:function name="html:text">
   <xsl:param name="node"/>
   <xsl:param name="delim"/>
   <xsl:variable name="together">
      <xsl:for-each select="$node//text()">
         <xsl:value-of select="concat(normalize-space(.),$delim)"/>
      </xsl:for-each>
   </xsl:variable>
   <xsl:value-of select="normalize-space($together)"/>
</xsl:function>

Usage:

<xsl:template match="html:tr">
   <xsl:value-of select="concat(html:text(html:td[1],' '),$NL)"/>
</xsl:template>

Uses:

html:anchor-labels

Definition:

<!-- https://github.com/timrdf/csv2rdf4lod-automation/wiki/Scraping-HTML#htmlanchor-labels -->
<xsl:function name="html:anchor-labels">
   <xsl:param name="anchors"/>

   <xsl:variable name="together">
      <xsl:for-each select="$anchors">
         <xsl:if test="position() gt 1">
            <xsl:value-of select="'||'"/>
         </xsl:if>
         <xsl:value-of select="normalize-space(.)"/>
      </xsl:for-each>
   </xsl:variable>

   <xsl:value-of select="normalize-space($together)"/>
</xsl:function>

Uses:

html:anchor-hrefs

Definition:

<!-- https://github.com/timrdf/csv2rdf4lod-automation/wiki/Scraping-HTML#htmlanchor-hrefs -->
<xsl:function name="html:anchor-hrefs">
   <xsl:param name="anchors"/>
   <xsl:param name="base"/>

   <xsl:variable name="together">
      <xsl:for-each select="$anchors">
         <xsl:if test="position() gt 1">
            <xsl:value-of select="'||'"/>
         </xsl:if>
         <xsl:value-of select="concat($base,normalize-space(@href))"/>
      </xsl:for-each>
   </xsl:variable>

   <xsl:value-of select="normalize-space($together)"/>
</xsl:function>

Uses:

html:parse-value

Uses:

  • n2yo-com/satellites/src/html2csv.xsl

html:capitalize

Definition:

 <xsl:function name="html:capitalize">
    <xsl:param name="string"/>
    <xsl:value-of select="concat(upper-case(substring($string,1,1)),
                                            substring($string, 2))"/>
 </xsl:function>
Clone this wiki locally