Skip to content

Cheat Sheet

Peng Cheng edited this page Jan 5, 2015 · 2 revisions

Current implementation only supports Language INtegrated Query (LINQ), APIs are not finalized and may change anytime in the future. Support for SQL is on the roadmap but may be abandoned in favour of simplicity.

Each query is a combination of 3 parts: Context, Action Plan and Extraction.

Context represents input and output data of a scraping job in key-value format. They are always created as strings or key-value pairs, being carried by other entities as metadata through a query's lifespan.

Creating of Context can use any Spark parallelization or transformation (though this is rarely used), e.g.:

- sc.parallelize("Metallica","Megadeth","Slayer","Anthrax")

- sc.parallelize(Map("first name"->"Taylor","last name"=>"Swift"), Map("first name"->"Avril","last name"->"Lavigne"))

- sc.parallelize("Taylor\tSwift", "Avril\tLavigne").csvToMap("first name\tlast name", "\t")

- sc.fromTextFile("products.txt")

- noInput(this creates a query entry point with no context)

Action Plan always has the following format:

(**Context** +> Action1 +> Action2 +> ... +> ActionN !)

These are the same actions a human would do to access the data page, their order of execution is identical to that they are defined.

Actions have 3 types:

  • Export: Export a page from a browser or client, the page an be any web resource including HTML/XML file, image, PDF file or JSON string.

  • Interactive: Interact with the browser (e.g. click a button or type into a search box) to reach the data page, all interactive executed before a page will be logged into that page's backtrace.

  • Container: Only for complex workflow control, each defines a nested/non-linear subroutine that may or may not be executed once or multiple times depending on situations.

Many Actions supports Context Interpolation: you can embed context reference in their constructor by inserting context's keys enclosed by #{}, which will be automatically replaced with values they map to in runtime. This is used almost exclusively in typing into a textbox.

For more information on Actions and Action Plan usage, please refer to the scaladoc of ClientAction.scala and ActionPlanRDDFunction.scala respectively.

Extraction defines a transformation from Pages (including immediate pages from Action Plans and their link connections -- see join/left-join) to relational data output. This is often the goal and last step of data collection, but not always -- there is no constraint on their relative order, you can reuse extraction results as context to get more data on a different site, or feed into another data flow implemented by other components of Apache Spark (Of course, only if you know them).

Functions in Extraction have four types:

  • extract: Extract data from Pages by using data's enclosing elements' HTML/XML/JSON selector(s) and attribute(s) as anchor points.

  • save/dump: Save all pages into a file system (HDD/S3/HDFS).

  • select: Extract data from Pages and insert them into the pages' respective context as metadata.

  • join: This is similar to the notion of join in relational databases, except that links between pages are used as foreign keys between tables. (Technical not just links, but anything that infers a connection between web resources, including frames, iframes, sources and redirection).

For more information on Extraction syntax, please refer to the scaladoc of Page.scala and PageRDDFunction.scala.

Clone this wiki locally