Command line program to download documents from web portals.

Overview

command line document download made easy



Highlights

  • list available documents in json format or download them
  • filter documents using
    • string matching
    • regular expressions or
    • jq queries
  • display captcha or QR codes for interactive input
  • writing new plugins is easy
  • existing plugins (some of them even work):
    • amazon
    • ing.de
    • dkb.de
    • o2.de
    • kabel.vodafone.de
    • conrad.de
    • elster.de



Dependencies



Installation

$ git clone --recursive https://github.com/heeplr/document-dl
$ cd document-dl
$ pip install .



Usage

Display Help:

$ document-dl -h
Usage: document-dl [OPTIONS] COMMAND [ARGS]...

  download documents from web portals

Options:
  -u, --username TEXT             login id  [env var: DOCDL_USERNAME]
  -p, --password TEXT             secret password  [env var: DOCDL_PASSWORD]
  -m, --match <ATTRIBUTE PATTERN>...
                                  only output documents where attribute
                                  contains pattern string  [env var:
                                  DOCDL_MATCH]

  -r, --regex <ATTRIBUTE REGEX>...
                                  only output documents where attribute value
                                  matches regex  [env var: DOCDL_REGEX]

  -j, --jq JQ_EXPRESSION          only output documents if json query matches
                                  document's attributes (see
                                  https://stedolan.github.io/jq/manual/ )
                                  [env var: DOCDL_JQ]

  -H, --headless BOOLEAN          show browser window if false  [env var:
                                  DOCDL_HEADLESS; default: True]

  -b, --browser [chrome|edge|firefox|ie|opera|safari|webkitgtk]
                                  webdriver to use for selenium based plugins
                                  [env var: DOCDL_BROWSER; default: chrome]

  -t, --timeout INTEGER           seconds to wait for data before terminating
                                  connection  [env var: DOCDL_TIMEOUT;
                                  default: 15]

  -i, --image-loading BOOLEAN     Turn off image loading when False  [env var:
                                  DOCDL_IMAGE_LOADING; default: False]

  -a, --action [download|list]    download or just list documents  [env var:
                                  DOCDL_ACTION; default: list]

  -h, --help                      Show this message and exit.

Commands:
  amazon    amazon.de (invoices)
  conrad    conrad.de (invoices)
  dkb       dkb.de with photoTAN (postbox)
  elster    elster.de with path to .pfx certfile as username (postbox)
  ing       banking.ing.de with photoTAN (postbox)
  o2        o2online.de (invoices/postbox)
  vodafone  kabel.vodafone.de (postbox, invoices)

Display plugin-specific help: (currently there is a bug in click that prompts for username and password before displaying the help)

$ document-dl ing --help
Usage: document-dl ing [OPTIONS]

  banking.ing.de with photoTAN (postbox)

Options:
  -k, --diba-key TEXT  DiBa Key  [env var: DOCDL_DIBA_KEY]
  -h, --help           Show this message and exit.



Examples

List all documents from vodafone.de, prompt for username/password:

$ document-dl vodafone

Same, but show browser window this time:

$ document-dl --headless=false vodafone

Download all documents from conrad.de, pass credentials as commandline arguments:

$ document-dl --username mylogin --password mypass --action download conrad

Download all documents from conrad.de, pass credentials as env vars:

$ DOCDL_USERNAME='mylogin' DOCDL_PASSWORD='mypass' document-dl --action download conrad

Download all documents from o2online.de where "doctype" attribute contains "BILL":

$ document-dl --match doctype BILL --action download o2

You can also use regular expressions to filter documents:

$ document-dl --regex date '^(2021-04|2021-05).*$' o2

List all documents from o2online.de where year >= 2019:

$ document-dl --jq 'select(.year >= 2019)' o2

Download document from elster.de with id == 15:

$ document-dl --jq 'contains({id: 15})' --action download elster



Writing a plugin

Plugins are click-plugins which in turn are normal @click.command's registered in setup.py

  • put your plugin into "docdl/plugins"

  • write your plugin class:

    • if you just need requests, inherit from docdl.WebPortal and use self.session that's initialized for you
    • if you need selenium, inherit from docdl.SeleniumWebPortal and use self.webdriver that's initialized for you
    • add click glue code
    • add your plugin to setup.py docdl_plugins registry
import docdl
import docdl.util

class MyPlugin(docdl.WebPortal):

    URL_LOGIN = "https://myservice.com/login"

    def login(self):
        request = self.session.get(self.URL_LOGIN)
        # ... authenticate ...
        if not_logged_in:
            return False
        return True

    def logout(self):
        # ... logout ...

    def documents(self):
        # iterate over all available documents
        for count, document in enumerate(all_documents):

            # scrape:
            #  * document attributes
            #    * it's recommended to assign an incremental "id"
            #      attribute to every document
            #    * if you set a "filename" attribute, it will be used to
            #      rename the downloaded file
            #    * dates should be parsed to datetime.datetime objects
            #      docdl.util.parse_date() should parse the most common strings
            #
            # also you must scrape either:
            #  * the download URL
            #
            # or (for SeleniumWebPortal plugins):
            #  * the DOM element that triggers download. It is expected
            #    that the download starts immediately after click() on
            #    the DOM element
            # or implement a custom download() method

            yield docdl.Document(
                url = this_documents_url,
                # download_element = <some selenium element to click>
                attributes = {
                    "id": count,
                    "category": "invoices",
                    "title": this_documents_title,
                    "filename": this_documents_target_filename,
                    "date": docdl.util.parse_date(some_date_string)
                }
            )


    def download(self, document):
        """you shouldn't need this for most web portals"""
        # ... save file to os.getcwd() ...
        return self.rename_after_download(document, filename)


@click.command()
@click.pass_context
def myplugin(ctx):
    """plugin description (what, documents, are, scraped)"""
    docdl.cli.run(ctx, MyPlugin)

and in setup.py:

# ...
setup(
    # ...
    packages=find_packages(
        # ...
        entry_points={
            'docdl_plugins': [
                # ...
                'myplugin=docdl.plugins.myplugin:myplugin',
                # ...
            ],
            # ...
        }
)



Security

Beware that your login credentials are most probably saved in your shell history when you pass them as commandline arguments. You can use the input prompt to avoid that or set environment variables safely.



Bugs

document-dl is still in a very early state of development and a lot of things don't work, yet. Especially a ton of edge cases need to be covered. If you find a bug, please open an issue or send a pull request.

  • --browser settings beside chrome probably don't work unless you help to test them
  • some services offer more documents/data than currently scraped



TODO

  • logging
  • better documentation
  • properly parse rfc6266
  • delete action
Comment Webpage Screenshot is a GitHub Action that captures screenshots of web pages and HTML files located in the repository

Comment Webpage Screenshot is a GitHub Action that helps maintainers visually review HTML file changes introduced on a Pull Request by adding comments with the screenshots of the latest HTML file cha

Maksudul Haque 21 Sep 29, 2022
A simple flask application to scrape gogoanime website.

gogoanime-api-flask A simple flask application to scrape gogoanime website. Used for demo and learning purposes only. How to use the API The base api

1 Oct 29, 2021
Scrapes all articles and their headlines from theonion.com

The Onion Article Scraper Scrapes all articles and their headlines from the satirical news website https://www.theonion.com Also see Clickhole Article

0 Nov 17, 2021
A python tool to scrape NFT's off of OpenSea

Right Click Bot A script to download NFT PNG's from OpenSea. All the NFT's you could ever want, no blockchain, for free. Usage Must Use Python 3! Auto

15 Jul 16, 2022
Library to scrape and clean web pages to create massive datasets.

lazynlp A straightforward library that allows you to crawl, clean up, and deduplicate webpages to create massive monolingual datasets. Using this libr

Chip Huyen 2.1k Jan 06, 2023
Web scrapping

Project Setup Table of Contents Project Setup Table of Contents Run project locally Install Requirements Run script Run project locally Install Requir

Charles 3 Feb 04, 2022
New World Market Scraper

Bean Seller A New Worlds market scraper. Deployment This must be installed on Windows as it uses the Windows api to do its stuff Install Prerequisites

4 Sep 21, 2022
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
A modern CSS selector implementation for BeautifulSoup

Soup Sieve Overview Soup Sieve is a CSS selector library designed to be used with Beautiful Soup 4. It aims to provide selecting, matching, and filter

Isaac Muse 151 Dec 23, 2022
Audio media crawler for lbry.

Audio media crawler for lbry. Requirements Python 3.8 Poetry 1.1.7 Elasticsearch 7.14.0 Lbry-sdk 0.99.0 Development This project uses poetry as a depe

Hound.fm 4 Dec 03, 2022
A Python package that scrapes Google News article data while remaining undetected by Google.

A Python package that scrapes Google News article data while remaining undetected by Google. Our scraper can scrape page data up until the last page and never trigger a CAPTCHA (download stats: https

Geminid Systems, Inc 6 Aug 10, 2022
A web scraper which checks price of a product regularly and sends price alerts by email if price reduces.

Amazon-Web-Scarper Created a web scraper using simple functions to check price of a product on amazon (can be duplicated to check price at other marke

Swaroop Todankar 1 Jan 17, 2022
Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN

Lexile-Atos-Scraper Quick Project made to help scrape Lexile and Atos(AR) levels from ISBN You will need to install the chrome webdriver if you have n

1 Feb 11, 2022
A multithreaded tool for searching and downloading images from popular search engines. It is straightforward to set up and run!

🕳️ CygnusX1 Code by Trong-Dat Ngo. Overviews 🕳️ CygnusX1 is a multithreaded tool 🛠️ , used to search and download images from popular search engine

DatNgo 32 Dec 31, 2022
爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说、招标网、采购网、小红书》

lxSpider 爬虫案例合集。包括但不限于《淘宝、京东、天猫、豆瓣、抖音、快手、微博、微信、阿里、头条、pdd、优酷、爱奇艺、携程、12306、58、搜狐、百度指数、维普万方、Zlibraty、Oalib、小说网站、招标采购网》 简介: 时光荏苒,记不清写了多少案例了。

lx 793 Jan 05, 2023
This is a webscraper for a specific website

This is a webscraper for a specific website. It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Rahul Siyanwal 1 Dec 13, 2021
Get paper names from dblp.org

scraper-dblp Get paper names from dblp.org and store them in a .txt file Useful for a related literature :) Install libraries pip3 install -r requirem

Daisy Lab 1 Dec 07, 2021
Scrape plants scientific name information from Agroforestry Species Switchboard 2.0.

Agroforestry Species Switchboard 2.0 Scraper Scrape plants scientific name information from Species Switchboard 2.0. Requirements python = 3.10 (you

Mgs. M. Rizqi Fadhlurrahman 2 Dec 23, 2021
京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

京东茅台抢购最新优化版本,京东茅台秒杀,优化了茅台抢购进程队列

MaoTai 129 Dec 14, 2022
Scraping script for stats on covid19 pandemic status in Chiba prefecture, Japan

About 千葉県の地域別の詳細感染者統計(Excelファイル) をCSVに変換し、かつ地域別の日時感染者集計値を出力するスクリプトです。 Requirement POSIX互換なシェル, e.g. GNU Bash (1) curl (1) python = 3.8 pandas = 1.1.

Conv4Japan 1 Nov 29, 2021