Deploying
Installation
Why so complex? All you need is is a single library in your Java/Scala classpath:
groupId: com.tribbloids.spookystuff
artifactId: spookystuff-assembly_2.10
version: 0.3.2
You have 3 options: download it, build it or let a dependency manager (Apache Maven, sbt, Gradle etc.) do it for you.
Direct Download
Stable (0.3.2) | Nightly (0.3.3-SNAPSHOT) | |
---|---|---|
Library | Download .jar | Download .jar |
Bundled with Spark 1.5.1 | Download .zip | Download .zip |
Bundled with Spark 1.4.1 | Download .zip | Download .zip |
Bundled with Spark 1.3.1 | Download .zip | Download .zip |
This pre-built JAR/bundle provide full functionality out-of-the-box, however you need a Apache Spark installation first (including integrated Spark environment, e.g. Notebooks in databricks™ Cloud or Apache Zeppelin). If you haven’t done so, please refer to Apache Spark installation Guide or Integration Section.
As a Dependency
if you want to use SpookyStuff as a library in your source code, the easiest way is to let your dependency manager (e.g. Apache Maven, sbt, gradle) to download it automatically from the Maven Central Repository, by adding the following artifact reference into your build definition:
Many integrated Spark environments (e.g. Spark-Shell, databricks™ Cloud or Apache Zeppelin) has built-in dependency manager, which makes deployment much easier by eliminating the necessity of manual download. This is again covered in Integration Section.
Optional Components
SpookyStuff natively supports 2 headless browers for deep web access: HtmlUnit and the much more capable PhantomJS. The binary executable of PhantomJS is NOT included in the pre-built JAR, and if it’s’ not detected on all Spark workers, it will be automatically downloaded from a PhantomJS mirror. To avoid repeated download you can download a permanent copy to each Spark worker:
And tell SpookyStuff to use them by setting its environment variable PHANTOMJS_PATH or system property phantomjs.binary.path to its full installation path (you don’t need to set it on Spark workers). Next time SpookyStuff will use them instead of download. It should be noted that all PhantomJS paths on workers should be identical, otherwise it is impossible to set SpookyStuff to use all of them.
Another optional component for better deep web compatibility is TOR client, it can be used by SpookyStuff as a socks5 proxy (both http and socks proxies are supported by SpookyStuff). Installation of TOR client is OS-specific, users are advised to follow this instruction to install and launch TOR client on their Spark workers.
Configuration
The following options can be set independently for each SpookyContext (by changing its conf.{option-name} property in your program), or set collectively by environment variables and/or Spark/system properties. If multiple values for an option are set by different methods (e.g. environment variable SPOOKY_CACHE and Spark property spooky.cache are set differently), The preceding value in the following list will override the others.
In-Program | > | Spark Property | > | System Property | > | Environment Variable | > | Default |
Web Caching
Name | System Property | Environment Variable | Default | Meaning |
---|---|---|---|---|
cacheWrite | N/A | N/A | true | If web resources should be cached. Local resources (from URI starting with file://, hdfs:// or s3:// ) won’t be cached |
cacheRead | N/A | N/A | true | If web resources should be attempted to be restored from web cache with much lower costs instead of being fetched remotely |
pageExpireAfter | N/A | N/A | 7.days | Cached resources older than this duration are deemed ‘expired’ and won’t be restored |
pageNotExpiredSince | N/A | N/A | None | Cached resources with a timestamp older than this are deemed ‘expired’ and won’t be restored |
dirs.cache | spooky.dirs.cache | N/A | {dirs.root}/cache | URI of the directory for web caching, the URI format should be like scheme://authority/path (read this article for details). Users may want to set this to a unified location like an HDFS directory so all Spark workers and applications can reuse cached resources |
Query Optimization
Name | System Property | Environment Variable | Default | Meaning |
---|---|---|---|---|
defaultQueryOptimizer | N/A | N/A | Wide | Set default Query Optimizer to one of the 3 options: Narrow, Wide, or Wide_RDDWebCache, Query Optimizer affects how duplicate & unnecessary remote access (e.g. Crawling a directory with diamond links) are handled before being distributed and executed, The meaning of these options are described in the following table |
defaultParallelism | N/A | N/A | 8*{number-of-cores} | Default number of partitions for all RDDs generated by SpookyStuff queries. parallel operations depending on remote access are highly skewed and it is generally a good practice to set the parallelism a few times larger than common Spark tasks |
defaultStorageLevel | N/A | N/A | MEMORY_ONLY | Default storage level of “temporarily” persisted RDDs during queries’ execution. These persisted RDDs takes large chunks of memory to facilitate complex query optimization but are usually evicted immediately beyond their intended usage. If your Spark cluster frequently encounter memory overflow issue, try setting this to MEMORY_ONLY_SER or DISK_ONLY, please refer to [Spark Storage Level] article for details |
checkpointInterval | N/A | N/A | 100 | Like [Spark Streaming] and [MLlib], [exploring link graph] in SpookyStuff is an iterative process and relies on periodic RDD checkpointing to recover from failure and avoid very long dependency graph. If this is set to a positive integer, RDD with a dependency chain longer than this will be checkpointed to a directory defined by dirs.checkpoint |
dirs.checkpoint | spooky.dirs.checkpoint | N/A | {dirs.root}/checkpoint | URI of the directory for checkpointing, the URI format should be like scheme://authority/path (read this article for details) |
Query Optimizer Options
Query Optimizer | Meaning |
---|---|
Narrow | Try to minimize shuffling and number of stages by avoiding “wide” transformations (grouping, cogrouping): this means duplicated resources in different partitions are fetched as-is and efficient execution relies mostly on web caching, only recommended if you know that chance of duplicates across multiple threads are low, or shuffling costs are high due to heavy iterations (e.g. exploring pagination though links) |
Wide | Aggressively merge duplicate actions before each execution and cast fetched results into multiple rows afterwards: this is often the most efficient tradeoff in terms of total cost, with the only notable caveat being exploring deeply through hyperlinks: merging actions across multiple partitions takes at least one wide transformations which may become expensive in iterative graph exploring. The tradeoff between Wide and Narrow Optimizers becomes tricky if both duplication and heavy iterations co-exist, we will gradually improve our query optimizer to be more adaptive. |
Wide_RDDWebCache | Include all optimization measures in Wide optimizer, plus an indexed RDD is assigned to each query as an in-memory web cache. reading/writing this cache is faster than the file system-based web cache but also eats a lot of memory (similar to an L1-cache as opposed to FS-based L2-cache). RDD web caches are NOT shared between queries and are usually scraped after the query finished execution |
- Wide_RDDWebCache is an experimental feature that should be used tentatively, RDDs are designed to be immutable so using this optimizer to collect big dataset may easily results in memory overflow. In addition, some in-memory file systems, e.g. Tachyon and Apache Ignite, may already achieved comparable speed which renders RDD web cache non-competitive.
Failover
Name | System Property | Environment Variable | Default | Meaning |
---|---|---|---|---|
remoteResourceTimeout | N/A | N/A | 60.seconds | Max waiting duration of fetching a remote resource (including loading dynamic content in browsers) before the connection is retried |
DFSTimeout | N/A | N/A | 40.seconds | Max waiting duration of a file system I/O (e.g. reading/updating the web cache) before the connection is retried |
failOnDFSError | N/A | N/A | false | Whether to fail fast if a file system I/O failed many times and can be circumvented otherwise (e.g. web cache access, in worst case SpookyStuff can smoothly fall back to fetching remotely), setting this option to true if you are using an unstable HDFS/S3 |
errorDump | N/A | N/A | true | Whether to perform a session buffer dump on web client exception |
errorScreenshot | N/A | N/A | true | Whether to take a screenshot of a browser viewport on its exception, effective only for browsers supporting screenshot, in this version phantomJS is the only option supporting this feature |
dirs.errorDump | spooky.dirs.errordump | N/A | {dirs.root}/cache | URI of the directory for error dump |
dirs.errorScreenshot | spooky.dirs.errorscreenshot | N/A | {dirs.root}/cache | URI of the directory for error screenshot |
dirs.errorDumpLocal | spooky.dirs.errordump.local | N/A | {dirs.root}/cache | if dirs.errorDump is not accessible, use this directory as a backup |
dirs.errorScreenshotLocal | spooky.dirs.errorscreenshot.local | N/A | {dirs.root}/cache | if dirs.errorScreenshot is not accessible, use this directory as a backup |
Web Client
Name | System Property | Environment Variable | Default | Meaning |
---|---|---|---|---|
driverFactory | N/A | N/A | PhantomJS | Meaning |
proxy | N/A | N/A | NoProxy | Meaning |
userAgent | N/A | N/A | null | Meaning |
headers | N/A | N/A | Map() | Meaning |
oAuthKeys | N/A | N/A | null | Meaning |
browserResolution | N/A | N/A | 1920x1080 | Meaning |
Miscellaneous
Name | System Property | Environment Variable | Default | Meaning |
---|---|---|---|---|
dirs.root | spooky.dirs.root | N/A | {java working directory}/temp | Meaning |
autoSave | N/A | N/A | true | Meaning |
dirs.autoSave | spooky.dirs.autosave | N/A | {dirs.root}/cache | Meaning |
shareMetrics | N/A | N/A | false | Meaning |
defaultJoinType | N/A | N/A | LeftOuter | Meaning |
maxJoinOrdinal | N/A | N/A | 2^31-1 | Meaning |
maxExploreDepth | N/A | N/A | 2^31-1 | Meaning |
Scaling
SpookyStuff is optimized for running on Spark cluster mode, which accelerates execution by parallelizing over multiple machine’s processing power and network bandwidth, in most cases this is highly recommended in production, and the only feasible way for running query & valorization on big dataset. However, it is important to understand the following fact and ensure that your query’s execution is in compliance with your web service providers, fail to understand the consequence may results in your API key being banned or yourself being prosecuted.
-
Despite being able to scale up to hundreds of nodes, SpookyStuff can only approximate linear speed gain (speed proportional to parallelism) if there is no other bottleneck, namely, your concurrent access should be smoothly handled by the web services being queried (e.g. brokered by a CDN or load balancer) and your cluster’s network topology. Otherwise blindly increasing the size of your cluster will only yield diminishing return.
-
Your API credential will be shared by multiple IP addresses of your cluster for all API calls, this may max out web server’s connection pool and cause heavy load on its infrastructure, make sure that this is not frown upon by your API provider!
-
Web cache and checkpointing directory must be on a distributed file system (HDFS, S3 etc.), as well as other options under SpookyConf.dirs, otherwise neither SpookyStuff nor yourself can access them reliably.
In addition: We also recommend using the following Spark properties for better performance:
-
spark.task.maxFailures=100 (or any sufficiently high number): external web services are less stable than in-house web services, so make sure SpookyStuff can retry many times from multiple machines to overcome service downtime and connection error. It should be noted that retry is partition-wise, so make sure your web cache is enabled to avoid repeated remote access.
-
spark.serializer=org.apache.spark.serializer.KryoSerializer: Shuffling and broadcasting over mutiple machines are much more expensive so try to enable the more efficient Kryo serializer if you haven’t done so.
-
spark.kryoserializer.buffer.max=512m (or a size enough to handle your largest partition): The default value of 64m may be unable to handle large files, increase it if you ran into KryoException.
-
spark.kryo.registrator=com.tribbloids.spookystuff.SpookyRegistrator: Not necessary but can be helpful in reducing Serialization size.
More Information
Like other Spark applications, SpookyStuff can benefit from many other option Spark offered, please refer to Spark configuration and Spark Tuning Guide for more information.
1-Line installation with Ansible
The following section propose a fully automated install routine using Ansible, its only tested on Debian-like Linux OS. Please ignore this section if your Spark environment is not using Debian, Ubuntu or their variants.