Merge pull request #39 from ecoron/0.11.0

0.11.0
ecoron · Aug 26, 2018 · 5f943ec · 5f943ec
2 parents 610fd72 + faa250b
commit 5f943ec
Show file tree

Hide file tree

Showing 23 changed files with 296 additions and 101 deletions.
diff --git a/.gitignore b/.gitignore
@@ -89,4 +89,5 @@ ENV/
 .settings
 
 make.bat
-phantomjs/
+phantomjs/
+chromedriver/
diff --git a/.travis.yml b/.travis.yml
@@ -8,8 +8,7 @@ sudo: required
 
 
 install:
-  - sh install_chrome.sh
-#  - pip install -r requirements.txt
+#  - sh install_chrome.sh
   - python setup.py -q install
 # command to run tests
 script: pytest
diff --git a/README.rst b/README.rst
@@ -29,7 +29,7 @@ Extract these result types
 * results - standard search result
 * shopping - shopping teaser within regular search results
 
-For each result in a resultspage get
+For each result of a resultspage get
 ====================================
 
 * domain
@@ -44,8 +44,8 @@ For each result in a resultspage get
 
 Also get a screenshot of each result page.
 You can also scrape the text content of each result url.
-It also possible to save the results as CSV for future analytics.
-If required you can use your own proxylist.
+It is also possible to save the results as CSV for future analytics.
+If required you can also use your own proxylist.
 
 
 Ressources
@@ -106,11 +106,26 @@ To avoid encode/decode issues use this command before you start using SerpScrap
 .. image:: https://raw.githubusercontent.com/ecoron/SerpScrap/master/docs/logo.png
     :target: https://github.com/ecoron/SerpScrap
 
+Supported OS
+------------
+
+* SerpScrap should work on Linux, Windows and Mac OS with installed Python >= 3.4
+* SerpScrap requieres lxml
+* Doesn't work on iOS
 
 Changes
 -------
 Notes about major changes between releases
 
+0.11.0
+======
+
+* Chrome headless is now the default browser, usage of phantomJS is deprecated
+* chromedriver is installed on the first run (tested on Linux and Windows. Mac OS should also work)
+* behavior of scraping raw text contents from serp urls, and of course given urls, has changed
+* run scraping of serp results and contents at once
+* csv output format changed, now it's tab separated and quoted
+
 0.10.0
 ======
 
@@ -132,13 +147,17 @@ Notes about major changes between releases
 References
 ----------
 
-SerpScrap is using `PhantomJs`_ a scriptable headless WebKit, which is installed automaticly on the first run (Linux, Windows).
-The scrapcore is based on `GoogleScraper`_ with several improvements.
+SerpScrap is using `Chrome headless`_ and `lxml`_ to scrape serp results. For raw text contents of fetched URL's, it is using `beautifulsoup4`_ .
+SerpScrap also supports `PhantomJs`_ ,which is deprecated, a scriptable headless WebKit, which is installed automaticly on the first run (Linux, Windows).
+The scrapcore was based on `GoogleScraper`_ , an outdated project, and has many changes and improvemts.
 
 .. target-notes::
 
 .. _`install`: http://serpscrap.readthedocs.io/en/latest/install.html
 .. _`examples`: http://serpscrap.readthedocs.io/en/latest/examples.html
+.. _`Chrome headless`: http://chromedriver.chromium.org/
+.. _`lxml`: https://lxml.de/
+.. _`beautifulsoup4`: https://www.crummy.com/software/BeautifulSoup/
 .. _`PhantomJs`: https://github.com/ariya/phantomjs
 .. _`GoogleScraper`: https://github.com/NikolaiT/GoogleScraper
 
diff --git a/docs/conf.py b/docs/conf.py
@@ -58,9 +58,9 @@
 # built documents.
 #
 # The short X.Y version.
-version = '0.10'
+version = '0.11'
 # The full version, including alpha/beta/rc tags.
-release = '0.10.4'
+release = '0.11.0'
 
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.

diff --git a/docs/configuration.rst b/docs/configuration.rst
@@ -21,12 +21,13 @@ Default configuration
 * database_name: '/tmp/serpscrap'                     - path and name sqlite db (stores scrape results)
 * dir_screenshot: '/tmp/screenshots'                  - basedir for saved screenshots
 * do_caching: True                                    - enable / disable caching
-* executable_path: '/usr/local/bin/chromedriver'      - path to chromedriver
+* executable_path: '/usr/local/bin/chromedriver'      - path to chromedriver, should detected automaticly
 * google_search_url: 'https://www.google.com/search?' - base search url, modify for other countries
 * headers:                                            - dict to customize request header, see below
 * num_pages_for_keyword: 2                            - number of result pages to scrape
 * num_results_per_page: 10                            - number results per searchengine page
 * proxy_file: ''                                      - path to proxy file, see below
+* sel_browser: 'chrome'                               - browser (chrome, phantomjs)
 * scrape_urls: False                                  - scrape urls of search results
 * screenshot: True                                    - enable screenshots for each query
 * search_engines: ['google']                          - search engines (google)
@@ -80,7 +81,7 @@ for not provided config keys the deault values still exists.
 Headers
 -------
 
-You can customize your searchengine request headers
+You can customize your searchengine request headers if you are using phantomJS
 by providing a dict in your configuration. If you
 don't customize this setting, the default is used.
 

diff --git a/docs/examples.rst b/docs/examples.rst
@@ -34,62 +34,60 @@ You can disable url scraping by setting the config value scrape_urls to False.
    for result in results:
        print(result)
 
-Simple Example - custom phantomjs path
---------------------------------------
+Simple example using phantomjs (deprecated)
+-------------------------------------------
 
-If phantomjs could not installed, configure your
-custom path to the binary.
+.. code-block:: bash
 
-.. code-block:: python
+   python examples\example_phantomjs.py
 
+It is possible to use phantomJS, but we recomment Chrome. Depending on your choice both will be tried to install automaticly.
+For using Chrome you need the latest `chromedriver`_ and to set the executable_path.
+
+.. code-block:: bash
+
+   import pprint
    import serpscrap
    
-   keywords = ['seo trends', 'seo news', 'seo tools']
+   keywords = ['berlin']
    
    config = serpscrap.Config()
-   # only required if phantomjs binary could not detected
-   config.set('executable_path', '../phantomjs/phantomjs.exe')
-   config.set('num_workers', 1)
-   config.set('scrape_urls', False)
+   config.set('sel_browser', 'phantomjs')
    
    scrap = serpscrap.SerpScrap()
    scrap.init(config=config.get(), keywords=keywords)
    results = scrap.run()
+   
    for result in results:
-       if 'serp_title' in result and len(result['serp_title']) > 1:
-           print(result['serp_title'])
-
-Using Chrome
-------------
+       pprint.pprint(result)
+       print()
 
-.. code-block:: bash
 
-   python examples\example_chrome.py
+Simple Example - custom phantomjs path (deprecated)
+---------------------------------------------------
 
-It is possible to use Chrome, but we recomment PhantomJs, which is installed by default.
-For using Chrome u need to download the latest `chromedriver`_ and to set the executable_path.
+If phantomjs could not installed, configure your
+custom path to the binary.
 
-.. code-block:: bash
+.. code-block:: python
 
-   import pprint
    import serpscrap
    
-   keywords = ['berlin']
+   keywords = ['seo trends', 'seo news', 'seo tools']
    
    config = serpscrap.Config()
-   config.set('sel_browser', 'chrome')
-   config.set('chrome_headless', True)
-   config.set('executable_path', '/tmp/chromedriver_win32/chromedriver.exe')
-   # linux
-   # config.set('executable_path', '/usr/local/bin/chromedriver')
+   config.set('sel_browser', 'phantomjs')
+   # only required if phantomjs binary could not detected
+   config.set('executable_path', '../phantomjs/phantomjs.exe')
+   config.set('num_workers', 1)
+   config.set('scrape_urls', False)
    
    scrap = serpscrap.SerpScrap()
    scrap.init(config=config.get(), keywords=keywords)
    results = scrap.run()
-   
    for result in results:
-       pprint.pprint(result)
-       print()
+       if 'serp_title' in result and len(result['serp_title']) > 1:
+           print(result['serp_title'])
 
 Image search
 ------------
@@ -137,11 +135,10 @@ In this example we scrape only an url, without crawling any searchengine.
    config = serpscrap.Config()
    
    urlscrape = serpscrap.UrlScrape(config.get())
-   results = urlscrape.scrap_url(url)
+   result = urlscrape.scrap_url(url)
    
-   for result in results:
-       print(result)
-       print()
+   print(result)
+   print()
 
 
 Command Line
@@ -160,7 +157,7 @@ Example as_csv()
 
 save the results for later seo analytics by using the
 as_csv() method. this method needs as argument the path
-to the file.
+to the file. The saved file is tab separated and values are quoted.
 
 .. code-block:: python
 
@@ -173,7 +170,33 @@ to the file.
    
    scrap = serpscrap.SerpScrap()
    scrap.init(config=config.get(), keywords=keywords)
-   results = scrap.as_csv('/tmp/seo-research')
+   scrap.as_csv('/tmp/seo-research')
+
+
+Example serp results and raw text of result urls
+------------------------------------------------
+
+You can scrape serp results and fetching the raw text contents of result urls at once
+
+.. code-block:: bash
+
+   python examples\example_serp_urls.py
+
+The resulting data will have additional fields containing data from the scraped urls.
+
+.. code-block:: python
+
+   import serpscrap
+   
+   keywords = ['blockchain']
+   
+   config = serpscrap.Config()
+   config.set('scrape_urls', True)
+   
+   scrap = serpscrap.SerpScrap()
+   scrap.init(config=config.get(), keywords=keywords)
+   scrap.as_csv('/tmp/output')
+
 
 Example related
 ---------------

diff --git a/docs/index.rst b/docs/index.rst
@@ -32,7 +32,7 @@ Extract these result types
 * results - standard search result
 * shopping - shopping teaser within regular search results
 
-For each result in a resultspage get
+For each result of a resultspage get
 ====================================
 
 * domain
@@ -47,8 +47,8 @@ For each result in a resultspage get
 
 Also get a screenshot of each result page.
 You can also scrape the text content of each result url.
-It also possible to save the results as CSV for future analytics.
-If required you can use your own proxylist.
+It is also possible to save the results as CSV for future analytics.
+If required you can also use your own proxylist.
 
 
 Ressources
@@ -88,11 +88,31 @@ SerpScrap in your applications
 
 More detailes in the `examples`_ section of the documentation.
 
+Supported OS
+------------
+
+* SerpScrap should work on Linux, Windows and Mac OS with installed Python >= 3.4
+* SerpScrap requieres lxml
+* Doesn't work on iOS
 
 Changes
 =======
 Notes about major changes between releases
 
+0.11.0
+------
+
+* Chrome headless is now the default browser, usage of phantomJS is deprecated
+* chromedriver is installed on the first run (tested on Linux and Windows. Mac OS should also work)
+* behavior of scraping raw text contents from serp urls, and of course given urls, has changed
+* run scraping of serp results and contents at once
+* csv output format changed, now it's tab separated and quoted
+
+0.10.0
+------
+
+* support for headless chrome, adjusted default time between scrapes
+
 0.9.0
 -----
 
@@ -109,11 +129,17 @@ Notes about major changes between releases
 References
 ==========
 
-SerpScrap is using `PhantomJs`_ a scriptable headless WebKit, which is installed automaticly on the first run (Linux, Windows)
-The scrapcore is based on `GoogleScraper`_ with several improvements.
+SerpScrap is using `Chrome headless`_ and `lxml`_ to scrape serp results. For raw text contents of fetched URL's, it is using `beautifulsoup4`_ .
+SerpScrap also supports `PhantomJs`_ ,which is deprecated, a scriptable headless WebKit, which is installed automaticly on the first run (Linux, Windows).
+The scrapcore was based on `GoogleScraper`_ , an outdated project, and has many changes and improvements.
 
 .. target-notes::
 
+.. _`install`: http://serpscrap.readthedocs.io/en/latest/install.html
+.. _`examples`: http://serpscrap.readthedocs.io/en/latest/examples.html
+.. _`Chrome headless`: http://chromedriver.chromium.org/
+.. _`lxml`: https://lxml.de/
+.. _`beautifulsoup4`: https://www.crummy.com/software/BeautifulSoup/
 .. _`PhantomJs`: https://github.com/ariya/phantomjs
 .. _`GoogleScraper`: https://github.com/NikolaiT/GoogleScraper
 .. _`examples`: http://serpscrap.readthedocs.io/en/latest/examples.html

diff --git a/docs/install.rst b/docs/install.rst
@@ -7,19 +7,34 @@ Install
    pip uninstall SerpScrap -y
    pip install SerpScrap --upgrade
 
-On the first run SerpScrap will try to install the required PhantomJS binary on Windows and Linux instances.
-If self install doesnt work you can configure your custom path to the phantomjs binary.
+On the first run SerpScrap will try to install the required Chromedriver or PhantomJS binary on Windows and Linux instances.
+If self install doesnt work you can configure your custom path to the chromedriver or phantomjs binary.
+For Linux SerpScrap provides https://github.com/ecoron/SerpScrap/blob/master/install_chrome.sh, this should be executed automaticly on the first run.
 
-Requirements Windows
---------------------
+Chrome headless is recommended
+------------------------------
 
-for windows some dependecies are provided as binaries for python extension packages.
-you can find them under: http://www.lfd.uci.edu/~gohlke/pythonlibs/
-For your convenience here are the direct links:
+By default SerpScrap is using the headless Chrome.
+You can also use phantomJS, but it is deprecated and it is also blocked very fast by the searchengine.
+We recommend to use headless Chrome.
+
+lxml
+----
 
+lxml is required.
+
+Windows
+=======
+for windows you may need the lxml binary form here: http://www.lfd.uci.edu/~gohlke/pythonlibs/
+For your convenience here are the direct links:
 * `lxml`_
 
-maybe you need also `Microsoft Visual C++ Build Tools`_ installed.
+In some cases you may need also `Microsoft Visual C++ Build Tools`_ installed.
+
+iOS
+===
+is not supported yet
+
 
 cli encoding issues
 -------------------
@@ -33,8 +48,9 @@ To avoid encode/decode issues use this command before you start using SerpScrap
 
 
 References
+==========
 
 .. target-notes::
 
+.. _`lxml`: http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml
 .. _`Microsoft Visual C++ Build Tools`: http://landinghub.visualstudio.com/visual-cpp-build-tools
-.. _`lxml`: http://www.lfd.uci.edu/~gohlke/pythonlibs/#lxml