Skip to content

Commit

Permalink
Merge pull request #39 from ecoron/0.11.0
Browse files Browse the repository at this point in the history
0.11.0
  • Loading branch information
Ronald Schmidt authored Aug 26, 2018
2 parents 610fd72 + faa250b commit 5f943ec
Show file tree
Hide file tree
Showing 23 changed files with 296 additions and 101 deletions.
3 changes: 2 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -89,4 +89,5 @@ ENV/
.settings

make.bat
phantomjs/
phantomjs/
chromedriver/
3 changes: 1 addition & 2 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -8,8 +8,7 @@ sudo: required


install:
- sh install_chrome.sh
# - pip install -r requirements.txt
# - sh install_chrome.sh
- python setup.py -q install
# command to run tests
script: pytest
29 changes: 24 additions & 5 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ Extract these result types
* results - standard search result
* shopping - shopping teaser within regular search results

For each result in a resultspage get
For each result of a resultspage get
====================================

* domain
Expand All @@ -44,8 +44,8 @@ For each result in a resultspage get

Also get a screenshot of each result page.
You can also scrape the text content of each result url.
It also possible to save the results as CSV for future analytics.
If required you can use your own proxylist.
It is also possible to save the results as CSV for future analytics.
If required you can also use your own proxylist.


Ressources
Expand Down Expand Up @@ -106,11 +106,26 @@ To avoid encode/decode issues use this command before you start using SerpScrap
.. image:: https://raw.githubusercontent.com/ecoron/SerpScrap/master/docs/logo.png
:target: https://github.com/ecoron/SerpScrap

Supported OS
------------

* SerpScrap should work on Linux, Windows and Mac OS with installed Python >= 3.4
* SerpScrap requieres lxml
* Doesn't work on iOS

Changes
-------
Notes about major changes between releases

0.11.0
======

* Chrome headless is now the default browser, usage of phantomJS is deprecated
* chromedriver is installed on the first run (tested on Linux and Windows. Mac OS should also work)
* behavior of scraping raw text contents from serp urls, and of course given urls, has changed
* run scraping of serp results and contents at once
* csv output format changed, now it's tab separated and quoted

0.10.0
======

Expand All @@ -132,13 +147,17 @@ Notes about major changes between releases
References
----------

SerpScrap is using `PhantomJs`_ a scriptable headless WebKit, which is installed automaticly on the first run (Linux, Windows).
The scrapcore is based on `GoogleScraper`_ with several improvements.
SerpScrap is using `Chrome headless`_ and `lxml`_ to scrape serp results. For raw text contents of fetched URL's, it is using `beautifulsoup4`_ .
SerpScrap also supports `PhantomJs`_ ,which is deprecated, a scriptable headless WebKit, which is installed automaticly on the first run (Linux, Windows).
The scrapcore was based on `GoogleScraper`_ , an outdated project, and has many changes and improvemts.

.. target-notes::

.. _`install`: http://serpscrap.readthedocs.io/en/latest/install.html
.. _`examples`: http://serpscrap.readthedocs.io/en/latest/examples.html
.. _`Chrome headless`: http://chromedriver.chromium.org/
.. _`lxml`: https://lxml.de/
.. _`beautifulsoup4`: https://www.crummy.com/software/BeautifulSoup/
.. _`PhantomJs`: https://github.com/ariya/phantomjs
.. _`GoogleScraper`: https://github.com/NikolaiT/GoogleScraper

4 changes: 2 additions & 2 deletions docs/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,9 +58,9 @@
# built documents.
#
# The short X.Y version.
version = '0.10'
version = '0.11'
# The full version, including alpha/beta/rc tags.
release = '0.10.4'
release = '0.11.0'

# The language for content autogenerated by Sphinx. Refer to documentation
# for a list of supported languages.
Expand Down
5 changes: 3 additions & 2 deletions docs/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -21,12 +21,13 @@ Default configuration
* database_name: '/tmp/serpscrap' - path and name sqlite db (stores scrape results)
* dir_screenshot: '/tmp/screenshots' - basedir for saved screenshots
* do_caching: True - enable / disable caching
* executable_path: '/usr/local/bin/chromedriver' - path to chromedriver
* executable_path: '/usr/local/bin/chromedriver' - path to chromedriver, should detected automaticly
* google_search_url: 'https://www.google.com/search?' - base search url, modify for other countries
* headers: - dict to customize request header, see below
* num_pages_for_keyword: 2 - number of result pages to scrape
* num_results_per_page: 10 - number results per searchengine page
* proxy_file: '' - path to proxy file, see below
* sel_browser: 'chrome' - browser (chrome, phantomjs)
* scrape_urls: False - scrape urls of search results
* screenshot: True - enable screenshots for each query
* search_engines: ['google'] - search engines (google)
Expand Down Expand Up @@ -80,7 +81,7 @@ for not provided config keys the deault values still exists.
Headers
-------

You can customize your searchengine request headers
You can customize your searchengine request headers if you are using phantomJS
by providing a dict in your configuration. If you
don't customize this setting, the default is used.

Expand Down
95 changes: 59 additions & 36 deletions docs/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,62 +34,60 @@ You can disable url scraping by setting the config value scrape_urls to False.
for result in results:
print(result)
Simple Example - custom phantomjs path
--------------------------------------
Simple example using phantomjs (deprecated)
-------------------------------------------

If phantomjs could not installed, configure your
custom path to the binary.
.. code-block:: bash
.. code-block:: python
python examples\example_phantomjs.py
It is possible to use phantomJS, but we recomment Chrome. Depending on your choice both will be tried to install automaticly.
For using Chrome you need the latest `chromedriver`_ and to set the executable_path.

.. code-block:: bash
import pprint
import serpscrap
keywords = ['seo trends', 'seo news', 'seo tools']
keywords = ['berlin']
config = serpscrap.Config()
# only required if phantomjs binary could not detected
config.set('executable_path', '../phantomjs/phantomjs.exe')
config.set('num_workers', 1)
config.set('scrape_urls', False)
config.set('sel_browser', 'phantomjs')
scrap = serpscrap.SerpScrap()
scrap.init(config=config.get(), keywords=keywords)
results = scrap.run()
for result in results:
if 'serp_title' in result and len(result['serp_title']) > 1:
print(result['serp_title'])
Using Chrome
------------
pprint.pprint(result)
print()
.. code-block:: bash
python examples\example_chrome.py
Simple Example - custom phantomjs path (deprecated)
---------------------------------------------------
It is possible to use Chrome, but we recomment PhantomJs, which is installed by default.
For using Chrome u need to download the latest `chromedriver`_ and to set the executable_path.
If phantomjs could not installed, configure your
custom path to the binary.
.. code-block:: bash
.. code-block:: python
import pprint
import serpscrap
keywords = ['berlin']
keywords = ['seo trends', 'seo news', 'seo tools']
config = serpscrap.Config()
config.set('sel_browser', 'chrome')
config.set('chrome_headless', True)
config.set('executable_path', '/tmp/chromedriver_win32/chromedriver.exe')
# linux
# config.set('executable_path', '/usr/local/bin/chromedriver')
config.set('sel_browser', 'phantomjs')
# only required if phantomjs binary could not detected
config.set('executable_path', '../phantomjs/phantomjs.exe')
config.set('num_workers', 1)
config.set('scrape_urls', False)
scrap = serpscrap.SerpScrap()
scrap.init(config=config.get(), keywords=keywords)
results = scrap.run()
for result in results:
pprint.pprint(result)
print()
if 'serp_title' in result and len(result['serp_title']) > 1:
print(result['serp_title'])
Image search
------------
Expand Down Expand Up @@ -137,11 +135,10 @@ In this example we scrape only an url, without crawling any searchengine.
config = serpscrap.Config()
urlscrape = serpscrap.UrlScrape(config.get())
results = urlscrape.scrap_url(url)
result = urlscrape.scrap_url(url)
for result in results:
print(result)
print()
print(result)
print()
Command Line
Expand All @@ -160,7 +157,7 @@ Example as_csv()
save the results for later seo analytics by using the
as_csv() method. this method needs as argument the path
to the file.
to the file. The saved file is tab separated and values are quoted.
.. code-block:: python
Expand All @@ -173,7 +170,33 @@ to the file.
scrap = serpscrap.SerpScrap()
scrap.init(config=config.get(), keywords=keywords)
results = scrap.as_csv('/tmp/seo-research')
scrap.as_csv('/tmp/seo-research')
Example serp results and raw text of result urls
------------------------------------------------
You can scrape serp results and fetching the raw text contents of result urls at once
.. code-block:: bash
python examples\example_serp_urls.py
The resulting data will have additional fields containing data from the scraped urls.
.. code-block:: python
import serpscrap
keywords = ['blockchain']
config = serpscrap.Config()
config.set('scrape_urls', True)
scrap = serpscrap.SerpScrap()
scrap.init(config=config.get(), keywords=keywords)
scrap.as_csv('/tmp/output')
Example related
---------------
Expand Down
36 changes: 31 additions & 5 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ Extract these result types
* results - standard search result
* shopping - shopping teaser within regular search results

For each result in a resultspage get
For each result of a resultspage get
====================================

* domain
Expand All @@ -47,8 +47,8 @@ For each result in a resultspage get

Also get a screenshot of each result page.
You can also scrape the text content of each result url.
It also possible to save the results as CSV for future analytics.
If required you can use your own proxylist.
It is also possible to save the results as CSV for future analytics.
If required you can also use your own proxylist.


Ressources
Expand Down Expand Up @@ -88,11 +88,31 @@ SerpScrap in your applications
More detailes in the `examples`_ section of the documentation.

Supported OS
------------

* SerpScrap should work on Linux, Windows and Mac OS with installed Python >= 3.4
* SerpScrap requieres lxml
* Doesn't work on iOS

Changes
=======
Notes about major changes between releases

0.11.0
------

* Chrome headless is now the default browser, usage of phantomJS is deprecated
* chromedriver is installed on the first run (tested on Linux and Windows. Mac OS should also work)
* behavior of scraping raw text contents from serp urls, and of course given urls, has changed
* run scraping of serp results and contents at once
* csv output format changed, now it's tab separated and quoted

0.10.0
------

* support for headless chrome, adjusted default time between scrapes

0.9.0
-----

Expand All @@ -109,11 +129,17 @@ Notes about major changes between releases
References
==========

SerpScrap is using `PhantomJs`_ a scriptable headless WebKit, which is installed automaticly on the first run (Linux, Windows)
The scrapcore is based on `GoogleScraper`_ with several improvements.
SerpScrap is using `Chrome headless`_ and `lxml`_ to scrape serp results. For raw text contents of fetched URL's, it is using `beautifulsoup4`_ .
SerpScrap also supports `PhantomJs`_ ,which is deprecated, a scriptable headless WebKit, which is installed automaticly on the first run (Linux, Windows).
The scrapcore was based on `GoogleScraper`_ , an outdated project, and has many changes and improvements.

.. target-notes::

.. _`install`: http://serpscrap.readthedocs.io/en/latest/install.html
.. _`examples`: http://serpscrap.readthedocs.io/en/latest/examples.html
.. _`Chrome headless`: http://chromedriver.chromium.org/
.. _`lxml`: https://lxml.de/
.. _`beautifulsoup4`: https://www.crummy.com/software/BeautifulSoup/
.. _`PhantomJs`: https://github.com/ariya/phantomjs
.. _`GoogleScraper`: https://github.com/NikolaiT/GoogleScraper
.. _`examples`: http://serpscrap.readthedocs.io/en/latest/examples.html
Expand Down
34 changes: 25 additions & 9 deletions docs/install.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,19 +7,34 @@ Install
pip uninstall SerpScrap -y
pip install SerpScrap --upgrade
On the first run SerpScrap will try to install the required PhantomJS binary on Windows and Linux instances.
If self install doesnt work you can configure your custom path to the phantomjs binary.
On the first run SerpScrap will try to install the required Chromedriver or PhantomJS binary on Windows and Linux instances.
If self install doesnt work you can configure your custom path to the chromedriver or phantomjs binary.
For Linux SerpScrap provides https://github.com/ecoron/SerpScrap/blob/master/install_chrome.sh, this should be executed automaticly on the first run.

Requirements Windows
--------------------
Chrome headless is recommended
------------------------------

for windows some dependecies are provided as binaries for python extension packages.
you can find them under: http://www.lfd.uci.edu/~gohlke/pythonlibs/
For your convenience here are the direct links:
By default SerpScrap is using the headless Chrome.
You can also use phantomJS, but it is deprecated and it is also blocked very fast by the searchengine.
We recommend to use headless Chrome.

lxml
----

lxml is required.

Windows
=======
for windows you may need the lxml binary form here: http://www.lfd.uci.edu/~gohlke/pythonlibs/
For your convenience here are the direct links:
* `lxml`_

maybe you need also `Microsoft Visual C++ Build Tools`_ installed.
In some cases you may need also `Microsoft Visual C++ Build Tools`_ installed.

iOS
===
is not supported yet


cli encoding issues
-------------------
Expand All @@ -33,8 +48,9 @@ To avoid encode/decode issues use this command before you start using SerpScrap
References
==========

.. target-notes::

.. _`lxml`: http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
.. _`Microsoft Visual C++ Build Tools`: http://landinghub.visualstudio.com/visual-cpp-build-tools
.. _`lxml`: http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
Loading

0 comments on commit 5f943ec

Please sign in to comment.