Query
Overview
SpookyStuff’s’ Query Language is a short and powerful LINQ (Language-Integrated Query) that abstracts away the complexity of unstructured document parsing/traversing, link crawling and parallel computing/optimization, leaving only two things for you to consider: The data you want and the process to discover them. This leads to our first rule in syntax design: identical queries should expect identical results, and vice versa.
SpookyStuff is built on Scala, it borrows some concepts and most of its syntax from Spark SQL (e.g. inner & left join, explode, alias and select), notably its own Language-Integrated Query. This is for easier understanding and memorizing by SQL users, rather than draw an analogy between relational databases and linked/unstructured datasets. In fact, they have very different specifications: In linked data, join by expression and filter by predicate (“WHERE” clause in SQL) are prohibitively expensive if executed on the client side, which makes URI and web service-based lookup its only standard.
The following diagram illustrates the elements of SpookyStuff queries: context, clauses, actions and expressions.
To run it in any Scala environment, import all members of the package com.tribbloids.spookystuff.dsl:
Context
SpookyContext is the entry point of all queries, it can be constructed from a SparkContext or SQLContext, with SpookyConf as an optional parameter:
SpookyConf contains configuration options that are enumerated in Configuration Section, these won’t affect the result of the query, but have an impact on execution efficiency and resiliency. all these options can be set statically and changed dynamically, even in the middle of a query, e.g.
Multiple SpookyContext can co-exist and their configurations will only affect their derived queries respectively.
Clauses
Clauses are building blocks of a query, each denotes a specific pattern to discover new data from old data’s references (e.g. through hyperlinks). Clauses can be applied to any of the following Scala objects:
- PageRowRDD: this is the default data storage optimized for unstructured and linked data operations, like DataFrame optimized for relation data. Internally, PageRowRDD is a distributed and immutable row-storage that contains both unstructured “raw” documents and key-value pairs. it can be created from a compatible Spark dataset type by using SpookyContext.create():
And to those types by the following functions:
PageRowRDD is also the output of another clause so you can chain multiple clauses easily to form a long data valorization pipeline. In addition, it inherits SpookyContext from its source or predecessor, so settings in SpookyConf are persisted for all queries derived from it (unless you change them halfway, this is common if a query uses mutliple web services, each with its own Authentication credential or proxy requirement).
-
SpookyContext: equivalent to applying to an empty PageRowRDD with one row.
-
Any compatible Spark dataset type: equivalent to applying to a PageRowRDD initialized from SpookyContext.create(). To use this syntax you need to import context-aware implicit conversions:
The following 5 main clauses covered most data reference patterns in websites, documents and APIs, including, but not limited to, one-to-one, one-to-many, link graph and paginations.
fetch
Fetch is used to remotely fetch unstructured document(s) per row according to provided actions and load them into each row’s document buffer, which flush out previous document(s). Fetch can be used to retrieve an URI directly or combine data from different sources based on one-to-one relationship, e.g.
Where Wget
is a simple [action] that retrieves a document from a specified URI. Notice that the second URI is incomplete: part of it is denoted by '{_}'
, this is a shortcut for string interpolation: during execution, any part enclosed in '{key}'
will be replaced by a value in the same row the {key} identifies. String interpolation is part of a rich expression system covered in Expressions Section.
fetch parameters
Name | Default | Means |
---|---|---|
joinType | ||
flattenPagesPattern | ||
flattenPagesOrdinalKey | ||
numPartitions | ||
optimizer |
select & remove
Select is used to extract data from unstructured documents and persist into the key-value store, unlike SQL, select won’t discard existing data:
If the alias of an expression already exist in the store it will throw a QueryException and refuse to execute. In this case you can use ~! 'alias
to replace old value or ~+ alias
to append the value to existing ones as an array. To discard data from the key-value store, use remove:
flatten/explode
Flatten/Explode is used to transform each row into a sequence of rows, each has an object of the array selected from the original row:
flatten/explode parameters
Name | Default | Means |
---|---|---|
joinType |
SpookyStuff also provides flatSelect as a shorthand for flatten + select:
which is equivalent to:
One of its common usages is extracting multiple attributes and fields from DOM elements in a list or tree, e.g.
You may notice that the first parameter of flatSelect has no alias - SpookyStuff automatically assigns an alias ‘A to the flatten DOM elements as intermediate data, which also enables a few shorthand expressions for traversing deeper into them. This syntax will be used more often in join and explore.
join
Join is used to horizontally combine data from different sources based on one-to-many/many-to-many relationship, e.g. combining links on a search/index page with fulltext contents:
Join is a shorthand for flatten/explode + fetch, in which case the data/elements being flatten is similar to a foreign key in relational database.
join parameters
(See flatten/explode parameters and fetch parameters)
explore
Explore defines a parallel graph exploring pattern that can be best described as recursive join with deduplication: In each iteration, each row is joined with one or more documents, they are compared and merged with existing documents that has been traversed before, the iteration stops when maximum exploration depth has been reached, or no more new documents are found. Finally, all documents and their rows are concatenated vertically. Explore is used for conventional web crawling or to uncover “deep” connections in a graph of web resources, e.g. pagination, multi-tier taxonomy/disambiguation pages, “friend-of-a-friend” (FOAF) reference in a social/semantic network:
explore parameters
Name | Default | Means |
---|---|---|
joinType |
others
Actions
Each action defines a web client command, such as content download, loading a page in browser, or clicking a button.
Action can be chained to define a series of command in a client session, e.g:
In addition to the “chaining” operator +>, the following 2 can be used for branching:
-
Action(s) || Action(s)
: Combine multiple actions and/or chains in parallel, each executed in its own session and in parallel. This multiplies the cardinality of output with a fixed factor, e.g.
-
Action(s) *> Parallel-Actions(s)
: Cross/Cartesian join and concatenate a set of actions/chains with another set of actions/chains, resulting in all their combinations being executed in parallel. e.g.
The following tables list all out-of-the-box actions. parameters enclosed in square brackets are [optional] and are explained in Optional Parameter Section. If you are still not impressed and would like to write your own action, please refer to Writing Extension Section.
Web Client & System
Syntax | Means |
---|---|
Wget(URI-expression, [filter]) [~ ‘alias] | retrieve a document by its universal resource identifier (URI), whether its local or remote and the client protocol is determined by schema of the URI, currently, supported schema/protocals are http, https, ftp, file, hdfs, s3, s3n and all Hadoop-compatible file systems.* |
OAuthSign(Wget(…)) | sign the http/https request initiated by Wget by OAuth, using credential provided in SpookyContext, not effective on other protocols |
Delay([duration]) | hibernate the client session for a fixed duration |
RandomDelay([min, max]) | hibernate the client session for a random duration between {min} and {max} |
-
local resources (file, hdfs, s3, s3n and all Hadoop-compatible file systems) won’t be cached: their URIs are obviously not “universal”, plus fetching them directly is equally fast.
-
http and https clients are subjective to proxy option in SpookyContext, see Configuration Section for detail.
-
Wget is the most common action as well as the quickest. In many cases its the only action resolved in a fetch, join or explore, for which cases 3 shorthands clauses can be used:
Browser
The following actions are resolved against a browser launched at the beginning of the session. The exact type and specifications (screen resolution, image/javascript supports) of the browser being launched and its proxy setting are subjective to their respective options in SpookyContext. Launching the browser is optional and necessity driven: the session won’t launch anything if there is no need.
Syntax | Means |
---|---|
Visit(URI-expression, [delay, blocking]) | go to the website denoted by specified URI in the browser, usually performed by inserting the URI into the browser’s address bar and press enter |
Snapshot([filter]) [~ ‘alias] | export current document in the browser, the document is subjective to all DOM changes caused by preceding actions, exported documents will be loaded into the document buffer of the resulted PageRowRDD, if no SnapShot, Screenshot or Wget is defined in a chain, Snapshot will be automatically appended as the last action |
Screenshot([filter]) [~ ‘alias] | export screenshot of the browser as an image document in PNG format |
Assert([filter]) | Same as Snapshot, but this is only a dry-run which ensures that current document in the browser can pass the {filter}, nothing will be exported or cached |
Click(selector, [delay, blocking]) | perform 1 mouse click on the 1st visible element that qualify the jQuery {selector} |
ClickNext(selector, [delay, blocking]) | perform 1 mouse click on the 1st element that hasn’t been clicked before within the session, each session keeps track of its own history of interactions and elements involved |
Submit(selector, [delay, blocking]) | submit the form denoted by the 1st element that qualify the jQuery {selector}, this is usually performed by clicking, but can also be done by other interactions, like pressing enter on a text box |
TextInput(selector, text-expression, [delay, blocking]) | focus on the 1st element (usually but not necessarily a text box) that qualify the jQuery {selector}, then insert/paste the specified text |
DropDownSelect(selector, text-expression, [delay, blocking]) | focus on the 1st selectable list that qualify the jQuery {selector}, then select the item with specified value |
ExeScript(javascript-expression, [selector, delay, blocking]) | run javascript program against the 1st element that qualify the jQuery {selector}, or against the whole document if {selector} is set to null or missing |
DragSlider(selector, percentage, handleSelector, [delay, blocking]) | drag a slider handle denoted by {handleSelector} to a position defined by a specified percentage and an enclosing bounding box denoted by {selector} |
WaitFor(selector) | wait until the 1st element that qualify the jQuery {selector} appears |
- Visit +> Snapshot is the second most common action that does pretty much the same thing as Wget on HTML resources - except that it can render dynamic pages and javascript. Very few formats other than HTML/XML are supported by browsers without plug-in so its mandatory to fetch images and PDF files by Wget only. In addition, its much slower than Wget for obvious reason. If Visit +> Snapshot is the only chained action resolved in a fetch, join or explore, 3 shorthand clauses can be used:
- Multiple Snapshots and Screenshots can be defined in a chain of actions to persist different states of a browser session, their exported documents can be specifically referred by their aliases.
Flow Control
Syntax | Means |
---|---|
Loop(Action(s), [limit]) | Execute the specified {Action(s)} repeatedly until max iteration is reached or an exception is thrown during the execution (can be triggered by Assert) |
Try(Action(s), [retries, cacheError]) | By default, if an exception is thrown by an action, the entire session is retried, potentially by a different Spark executor on a different machine, this failover handling process continues until spark.task.maxFailures property has been reached, and the entire query failed fast. However, if the {Action(s)} throwing the exception is enclosed in the Try block, its last failure will be tolerated and only results in an error page being exported as a placeholder. Make sure to set spark.task.maxFailures to be larger than {retries}, or you query will fail fast before it even had a chance to interfere! |
If(condition, Action(s)-if-true, Action(s)-if-false) | perform a condition check on current document in the browser, then execute different <Action(s)> based on its result |
- Loop(ClickNext +> Snapshot) is generally used for turning pages when its pagination is implemented in AJAX rather than direct hyperlinks, in which case a shorthand is available:
Optional Parameters
Name | Default | Means |
---|---|---|
filter | ExportFilters.MustHaveTitle | |
alias | Action.toString | |
timeout | null | |
delay | 1.second | |
blocking | true | |
limit | 500 | |
retries | 3 | |
cacheError | false |
Expressions
In previous examples, you may already see some short expressions being used as clause or action parameters, like the string interpolation in URI templating, or the “Big S” / “Big A” notations in many select clauses for unstructured information refinement. These are parts of a much more powerful (and equally short) expression system. It allows complex data reference to be defined in a few words and operators (words! not lines), while other frameworks may take scores of line. E.g. the following query uses an URI expression (in the second Wget), which pipes all titles on Google News RSS feed into a single MyMemory translation API call:
Internally, each expression is a function from a single row in PageRowRDD to a nullable data container. Please refer to Writing Extension Section if you want to use customized expressions.
The following is a list of basic symbols and functions/operators that are defined explicitly as shorthands:
symbols
Scala symbols (identifier preceded by tick/’) has 2 meanings: it either returns a key-value pair by its key, or a subset of documents in the unstructured document buffer filtered by their aliases. SpookyStuff’s execution engine is smart enough not to throw an error unless there is a conselectorflict - In which case you need to use different names for data and documents.
The following 2 variables are also treated as symbols with special meaning:
-
S
: Returns the only document in the unstructured document buffer, error out if multiple documents exist. -
S_*
: Returns all documents in the unstructured document buffer.
string interpolation
You have seen the [basic form] of string interpolation early in some examples, which inserts only key-value pair(s) to a string template. The [above example] demonstrates a more powerful string interpolation, which inserts all kinds of expressions:
Any non-expression, non-string typed identifier is treated as literal.
unstructured document traversing
These are operators that does the most important things: traversing DOM of unstructured documents to get data you want (potentially structured or semi-structured).
SpookyStuff supports a wide range of documents including HTML, XHTML, XML, JSON, and all other formats supported by Apache Tika (PDF, Microsoft Office and much more). In fact, most of the following operators are format-agnostic: they use different parsers on different formats to get the same DOM tree. Not all these formats have equivalent DOM representation (e.g. JSON doesn’t have annotated text), so using a few operators on some formats doesn’t make sense and always returns null value.
The exact format of a document is dictated by its mime-type indicated by the web resource, if it is not available, SpookyStuff will auto-detect it by analysing its extension name and binary content.
All operators that fits into this category are listed in the following table, with their compatible formats and explanation.
Operator | Formats | Means |
.findAll(selector) | All | returns all elements that qualify the selector provided, which should be exact field name for JSON type and jQuery selector for all others |
\ selector | All | Same as above |
.findFirst(selector) | All | first element returned by findAll |
.children(selector”) | All | returns only the direct children that qualify by the selector provided, which should be exact field name for JSON type and jQuery selector for all others |
\ selector | All | Same as above |
.child(selector) | All | first element returned by children |
S”selector” | All | Big S selector: equivalent to S.findAll(selector) |
A”selector” | All | Big A selector: equivalent to ‘A.findAll(selector), ‘A is the default symbol for the elements/data pivoted in flatten, flatSelect, join or explore |
S_*“selector” | All | Equivalent to S_*.findAll(selector) |
.uri | All | URI of the document or DOM element, this may be different from the URI specified in Wget or Visit due to redirection(s) |
.code | All | returns the original HTML/XML/JSON code of the parsed document/DOM element as string |
.text | All | returns raw text of the parsed document/DOM element stripped of markups, on JSON this strips all field names and retains only their values |
.ownText | All | returns raw text of the parsed document/DOM element excluding those of its children, on JSON this returns null if the element is an object |
.attr(“attribute-name”, [nullable?]) | All | returns an attribure value of the parsed document/DOM element, on JSON this returns a property preceded by “@”, parameter {nullable?} decides whether a non-existing attribute should be returned as null or an empty string |
.href | All | same as .attr(“href”) on HTML/XML, same as .ownText on JSON |
.src | All | same as .attr(“src”) on HTML/XML, same as .ownText on JSON |
.boilerpipe | Non-JSON | use boilerpipe to extract the main textual content of a HTML file, this won’t work for JSON |
others
Many other functions are also supported by the expression system but they are too many to be listed here. Scala users are recommended to refer to source code and scaladoc of com.tribbloids.spookystuff.dsl
package, as these functions are built to resemble Scala functions and operators, rather than the less capable SQL LINQ syntax. In fact, SpookyStuff even use Scala reflective programming API to handle functions it doesn’t know.
Profiling
SpookyStuff has a metric system based on Spark’s Accumulator, it can be accessed from metrics property under the SpookyContext:
By default each query keep track of its own metric, if you would like to have all metrics of queries from the same SpookyContext to be aggregated, simply set conf.shareMetrics property of the SpookyContext to true.
Examples
Interactive examples are maintained on tribbloid® product page
More examples can be found under spookystuff-example package