KOBV-Portal

UID:

edocfu_9959234349002883

Umfang: 1 online resource (526 pages) : , illustrations

Ausgabe: 1st edition

ISBN: 1-78588-828-5

Inhalt: Master the techniques and sophisticated analytics used to construct Spark-based solutions that scale to deliver production-grade data science products About This Book Develop and apply advanced analytical techniques with Spark Learn how to tell a compelling story with data science using Spark's ecosystem Explore data at scale and work with cutting edge data science methods Who This Book Is For This book is for those who have beginner-level familiarity with the Spark architecture and data science applications, especially those who are looking for a challenge and want to learn cutting edge techniques. This book assumes working knowledge of data science, common machine learning methods, and popular data science tools, and assumes you have previously run proof of concept studies and built prototypes. What You Will Learn Learn the design patterns that integrate Spark into industrialized data science pipelines See how commercial data scientists design scalable code and reusable code for data science services Explore cutting edge data science methods so that you can study trends and causality Discover advanced programming techniques using RDD and the DataFrame and Dataset APIs Find out how Spark can be used as a universal ingestion engine tool and as a web scraper Practice the implementation of advanced topics in graph processing, such as community detection and contact chaining Get to know the best practices when performing Extended Exploratory Data Analysis, commonly used in commercial data science teams Study advanced Spark concepts, solution design patterns, and integration architectures Demonstrate powerful data science pipelines In Detail Data science seeks to transform the world using data, and this is typically achieved through disrupting and changing real processes in real industries. In order to operate at this level you need to build data science solutions of substance ?solutions that solve real problems. Spark has emerged as the big data platform of choice for data scientists due to its speed, scalability, and easy-to-use APIs. This book deep dives into using Spark to deliver production-grade data science solutions. This process is demonstrated by exploring the construction of a sophisticated global news analysis service that uses Spark to generate continuous geopolitical and current affairs insights.You will learn all about the core Spark APIs and take a comprehensive tour of advanced libraries, including Spark SQL, Spark Streaming, MLli...

Anmerkung: Includes index. , Cover -- Copyright -- Credits -- Foreword -- About the Authors -- About the Reviewer -- www.PacktPub.com -- Customer Feedback -- Table of Contents -- Preface -- Chapter 1: The Big Data Science Ecosystem -- Introducing the Big Data ecosystem -- Data management -- Data management responsibilities -- The right tool for the job -- Overall architecture -- Data Ingestion -- Data Lake -- Reliable storage -- Scalable data processing capability -- Data science platform -- Data Access -- Data technologies -- The role of Apache Spark -- Companion tools -- Apache HDFS -- Advantages -- Disadvantages -- Installation -- Amazon S3 -- Advantages -- Disadvantages -- Installation -- Apache Kafka -- Advantages -- Disadvantages -- Installation -- Apache Parquet -- Advantages -- Disadvantages -- Installation -- Apache Avro -- Advantages -- Disadvantages -- Installation -- Apache NiFi -- Advantages -- Disadvantages -- Installation -- Apache YARN -- Advantages -- Disadvantages -- Installation -- Apache Lucene -- Advantages -- Disadvantages -- Installation -- Kibana -- Advantages -- Disadvantages -- Installation -- Elasticsearch -- Advantages -- Disadvantages -- Installation -- Accumulo -- Advantages -- Disadvantages -- Installation -- Summary -- Chapter 2: Data Acquisition -- Data pipelines -- Universal ingestion framework -- Introducing the GDELT news stream -- Discovering GDELT in real-time -- Our first GDELT feed -- Improving with publish and subscribe -- Content registry -- Choices and more choices -- Going with the flow -- Metadata model -- Kibana dashboard -- Quality assurance -- [Example 1 - Basic quality checking, no contending users] -- Example 1 - Basic quality checking, no contending users -- Example 2 - Advanced quality checking, no contending users -- Example 3 - Basic quality checking, 50% utility due to contending users -- Summary. , Chapter 3: Input Formats and Schema -- A structured life is a good life -- GDELT dimensional modeling -- GDELT model -- First look at the data -- Core global knowledge graph model -- Hidden complexity -- Denormalized models -- Challenges with flattened data -- Issue 1 - Loss of contextual information -- Issue 2: Re-establishing dimensions -- Issue 3: Including reference data -- Loading your data -- Schema agility -- Reality check -- GKG ELT -- Position matters -- Avro -- Spark-Avro method -- Pedagogical method -- When to perform Avro transformation -- Parquet -- Summary -- Chapter 4: Exploratory Data Analysis -- The problem, principles and planning -- Understanding the EDA problem -- Design principles -- General plan of exploration -- Preparation -- Introducing mask based data profiling -- Introducing character class masks -- Building a mask based profiler -- Setting up Apache Zeppelin -- Constructing a reusable notebook -- Exploring GDELT -- GDELT GKG datasets -- The files -- Special collections -- Reference data -- Exploring the GKG v2.1 -- The Translingual files -- A configurable GCAM time series EDA -- Plot.ly charting on Apache Zeppelin -- Exploring translation sourced GCAM sentiment with plot.ly -- Concluding remarks -- A configurable GCAM Spatio-Temporal EDA -- Introducing GeoGCAM -- Does our spatial pivot work? -- Summary -- Chapter 5: Spark for Geographic Analysis -- GDELT and oil -- GDELT events -- GDELT GKG -- Formulating a plan of action -- GeoMesa -- Installing -- GDELT Ingest -- GeoMesa Ingest -- MapReduce to Spark -- Geohash -- GeoServer -- Map layers -- CQL -- Gauging oil prices -- Using the GeoMesa query API -- Data preparation -- Machine learning -- Naive Bayes -- Results -- Analysis -- Summary -- Chapter 6: Scraping Link-Based External Data -- Building a web scale news scanner -- Accessing the web content -- The Goose library. , Integration with Spark -- Scala compatibility -- Serialization issues -- Creating a scalable, production-ready library -- Build once, read many -- Exception handling -- Performance tuning -- Named entity recognition -- Scala libraries -- NLP walkthrough -- Extracting entities -- Abstracting methods -- Building a scalable code -- Build once, read many -- Scalability is also a state of mind -- Performance tuning -- GIS lookup -- GeoNames dataset -- Building an efficient join -- Offline strategy - Bloom filtering -- Online strategy - Hash partitioning -- Content deduplication -- Context learning -- Location scoring -- Names de-duplication -- Functional programming with Scalaz -- Our de-duplication strategy -- Using the mappend operator -- Simple clean -- DoubleMetaphone -- News index dashboard -- Summary -- Chapter 7: Building Communities -- Building a graph of persons -- Contact chaining -- Extracting data from Elasticsearch -- Using the Accumulo database -- Setup Accumulo -- Cell security -- Iterators -- Elasticsearch to Accumulo -- A graph data model in Accumulo -- Hadoop input and output formats -- Reading from Accumulo -- AccumuloGraphxInputFormat and EdgeWritable -- Building a graph -- Community detection algorithm -- Louvain algorithm -- Weighted Community Clustering (WCC) -- Description -- Preprocessing stage -- Initial communities -- Message passing -- Community back propagation -- WCC iteration -- Gathering community statistics -- WCC Computation -- WCC iteration -- GDELT dataset -- The Bowie effect -- Smaller communities -- Using Accumulo cell level security -- Summary -- Chapter 8: Building a Recommendation System -- Different approaches -- Collaborative filtering -- Content-based filtering -- Custom approach -- Uninformed data -- Processing bytes -- Creating a scalable code -- From time to frequency domain -- Fast Fourier transform. , Sampling by time window -- Extracting audio signatures -- Building a song analyzer -- Selling data science is all about selling cupcakes -- Using Cassandra -- Using the Play framework -- Building a recommender -- The PageRank algorithm -- Building a Graph of Frequency Co-occurrence -- Running PageRank -- Building personalized playlists -- Expanding our cupcake factory -- Building a playlist service -- Leveraging the Spark job server -- User interface -- Summary -- Chapter 9: News Dictionary and Real-Time Tagging System -- The mechanical Turk -- Human intelligence tasks -- Bootstrapping a classification model -- Learning from Stack Exchange -- Building text features -- Training a Naive Bayes model -- Laziness, impatience, and hubris -- Designing a Spark Streaming application -- A tale of two architectures -- The CAP theorem -- The Greeks are here to help -- Importance of the Lambda architecture -- Importance of the Kappa architecture -- Consuming data streams -- Creating a GDELT data stream -- Creating a Kafka topic -- Publishing content to a Kafka topic -- Consuming Kafka from Spark Streaming -- Creating a Twitter data stream -- Processing Twitter data -- Extracting URLs and hashtags -- Keeping popular hashtags -- Expanding shortened URLs -- Fetching HTML content -- Using Elasticsearch as a caching layer -- Classifying data -- Training a Naive Bayes model -- Thread safety -- Predict the GDELT data -- Our Twitter mechanical Turk -- Summary -- Chapter 10: Story De-duplication and Mutation -- Detecting near duplicates -- First steps with hashing -- Standing on the shoulders of the Internet giants -- Simhashing -- The hamming weight -- Detecting near duplicates in GDELT -- Indexing the GDELT database -- Persisting our RDDs -- Building a REST API -- Area of improvement -- Building stories -- Building term frequency vectors. , The curse of dimensionality, the data science plague -- Optimizing KMeans -- Story mutation -- The Equilibrium state -- Tracking stories over time -- Building a streaming application -- Streaming KMeans -- Visualization -- Building story connections -- Summary -- Chapter 11: Anomaly Detection on Sentiment Analysis -- Following the US elections on Twitter -- Acquiring data in stream -- Acquiring data in batch -- The search API -- Rate limit -- Analysing sentiment -- Massaging Twitter data -- Using the Stanford NLP -- Building the Pipeline -- Using Timely as a time series database -- Storing data -- Using Grafana to visualize sentiment -- Number of processed tweets -- Give me my Twitter account back -- Identifying the swing states -- Twitter and the Godwin point -- Learning context -- Visualizing our model -- Word2Graph and Godwin point -- Building a Word2Graph -- Random walks -- A Small Step into sarcasm detection -- Building features -- #LoveTrumpsHates -- Scoring Emojis -- Training a KMeans model -- Detecting anomalies -- Summary -- Chapter 12: TrendCalculus -- Studying trends -- The TrendCalculus algorithm -- Trend windows -- Simple trend -- User Defined Aggregate Functions -- Simple trend calculation -- Reversal rule -- Introducing the FHLS bar structure -- Visualize the data -- FHLS with reversals -- Edge cases -- Zero values -- Completing the gaps -- Stackable processing -- Practical applications -- Algorithm characteristics -- Advantages -- Disadvantages -- Possible use cases -- [Chart annotation] -- Chart annotation -- Co-trending -- Data reduction -- Indexing -- Fractal dimension -- Streaming proxy for piecewise linear regression -- Summary -- Chapter 13: Secure Data -- Data security -- The problem -- The basics -- Authentication and authorization -- Access control lists (ACL) -- Role-based access control (RBAC) -- Access -- Encryption. , Data at rest.

Weitere Ausg.: ISBN 1-78588-214-7

Sprache: Englisch

Kooperativer Bibliotheksverbund

Berlin Brandenburg