Results 1 -
4 of
4
Prioritizing Attention in Fast Data: Principles and Promise
"... ABSTRACT While data volumes continue to rise, the capacity of human attention remains limited. As a result, users need analytics engines that can assist in prioritizing attention in this fast data that is too large for manual inspection. We present a set of design principles for the design of fast ..."
Abstract
- Add to MetaCart
ABSTRACT While data volumes continue to rise, the capacity of human attention remains limited. As a result, users need analytics engines that can assist in prioritizing attention in this fast data that is too large for manual inspection. We present a set of design principles for the design of fast data analytics engines that leverage the relative scarcity of human attention and overabundance of data: return fewer results, prioritize iterative analysis, and filter fast to compute less. We report on our early experiences employing these principles in the design and deployment of MacroBase, an open source analysis engine for prioritizing attention in fast data. By combining streaming operators for feature transformation, classification, and data summarization, MacroBase provides users with interpretable explanations of key behaviors, acting as a search engine for fast data.
Ground: A Data Context Service
"... ABSTRACT Ground is an open-source data context service, a system to manage all the information that informs the use of data. Data usage has changed both philosophically and practically in the last decade, creating an opportunity for new data context services to foster further innovation. In this pa ..."
Abstract
- Add to MetaCart
ABSTRACT Ground is an open-source data context service, a system to manage all the information that informs the use of data. Data usage has changed both philosophically and practically in the last decade, creating an opportunity for new data context services to foster further innovation. In this paper we frame the challenges of managing data context with basic ABCs: Applications, Behavior, and Change. We provide motivation and design guidelines, present our initial design of a common metamodel and API, and explore the current state of the storage solutions that could serve the needs of a data context service. Along the way we highlight opportunities for new research and engineering solutions. FROM CRISIS TO OPPORTUNITY Traditional database management systems were developed in an era of risk-averse design. The technology itself was expensive, as was the on-site cost of managing it. Expertise was scarce and concentrated in a handful of computing and consulting firms. Two conservative design patterns emerged that lasted many decades. First, the accepted best practices for deploying databases revolved around tight control of schemas and data ingest in support of general-purpose accounting and compliance use cases. Typical advice from data warehousing leaders held that "There is no point in bringing data . . . into the data warehouse environment without integrating it" As computing and data have become orders of magnitude more efficient, changes have emerged for both of these patterns. Usage is changing profoundly, as expertise and control shifts from the central accountancy of an IT department to the domain expertise of "business units" tasked with extracting value from data Crisis: Big Metadata An unfortunate consequence of the disaggregated nature of contemporary data systems is the lack of a standard mechanism to assemble a collective understanding of the origin, scope, and usage of the data they manage. In the absence of a better solution to this pressing need, the Hive Metastore is sometimes used, but it only serves simple relational schemas-a dead end for representing a Variety of data. As a result, data lake projects typically lack even the most rudimentary information about the data they contain or how it is being used. For emerging Big Data customers and vendors, this Big Metadata problem is hitting a crisis point. Two significant classes of end-user problems follow directly from the absence of shared metadata services. The first is poor productivity. Analysts are often unable to discover what data exists, much less how it has been previously used by peers. Valuable data is left unused and human effort is routinely duplicated-particularly in a schema-on-use world with raw data that requires preparation. "Tribal knowledge" is a common description for how organizations manage this productivity problem. This is clearly not a systematic solution, and scales very poorly as organizations grow. The second problem stemming from the absence of a system to track metadata is governance risk. Data management necessarily entails tracking or controlling who accesses data, what they do with it, where they put it, and how it gets consumed downstream. In the absence of a standard place to store metadata and answer these questions, it is impossible to enforce policies and/or audit behavior. As a result, many administrators marginalize their Big Data stack as a playpen for non-critical data, and thereby inhibit both the adoption and the potential of new technologies. In our experiences deploying and managing systems in production, we have seen the need for a common service layer to support the capture, publishing and sharing of metadata information in a flexible way. The effort in this paper began by addressing that need. Opportunity: Data Context The lack of metadata services in the Big Data stack can be viewed as an opportunity: a clean slate to rethink how we track and leverage modern usage of data. Storage economics and schema-on-use agility suggest that the Data Lake movement could go much farther than Data Warehousing in enabling diverse, widely-used central repositories of data that can adapt to new data formats and rapidly changing organizations. In that spirit, we advocate rethinking traditional metadata in a far more comprehensive sense. More generally, what we should strive to capture is the full context of data. To emphasize the conceptual shifts of this data context, and as a complement to the "three Vs" of Big Data, we introduce three key sources of information-the ABCs of Data Context. Each represents a major change from the simple metadata of traditional enterprise data management. Applications: Application context is the core information that describes how raw bits get interpreted for use. In modern agile scenarios, application context is often relativistic (many schemas for the same data) and complex (with custom code for data interpretation). Application context ranges from basic data descriptions (encodings, schemas, ontologies, tags), to statistical models and parameters, to user annotations. All of the artifacts involved-wrangling scripts, view definitions, model parameters, training sets, etc.-are critical aspects of application context. Behavior: This is information about how data was created and used over time. In decoupled systems, behavioral context spans multiple services, applications and formats and often originates from highvolume sources (e.g., machine-generated usage logs). Not only must we track upstream lineage-the data sets and code that led to the creation of a data object-we must also track the downstream lineage, including data products derived from this data object. Aside from data lineage, behavioral context includes logs of usage: the "digital exhaust" left behind by computations on the data. As a result, behavioral context metadata can often be larger than the data itself. Change: This is information about the version history of data, code and associated information, including changes over time to both structure and content. Traditional metadata focused on the present, but historical context is increasingly useful in agile organizations. This context can be a linear sequence of versions, or it can encompass branching and concurrent evolution, along with interactions between co-evolving versions. By tracking the version history of all objects spanning code, data, and entire analytics pipelines, we can simplify debugging and enable auditing and counterfactual analysis. Data context services represent an opportunity for database technology innovation, and an urgent requirement for the field. We are building an open-source data context service we call Ground, to serve as a central model, API and repository for capturing the broad context in which data gets used. Our goal is to address practical problems for the Big Data community in the short term and to open up opportunities for long-term research and innovation. In the remainder of the paper we illustrate the opportunities in this space, design requirements for solutions, and our initial efforts to tackle these challenges in open source. DIVERSE USE CASES To illustrate the potential of the Ground data context service, we describe two concrete scenarios in which Ground can aid in data discovery, facilitate better collaboration, protect confidentiality, help diagnose problems, and ultimately enable new value to be captured from existing data. After presenting these scenarios, we explore the design requirements for a data context service. Scenario: Context-Enabled Analytics This scenario represents the kind of usage we see in relatively technical organizations making aggressive use of data for machinelearning driven applications like customer targeting. In these organizations, data analysts make extensive use of flexible tools for data preparation and visualization and often have some SQL skills, while data scientists actively prototype and develop custom software for machine learning applications. Janet is an analyst in the Customer Satisfaction department at a large bank. She suspects that the social network behavior of customers can predict if they are likely to close their accounts (customer churn). Janet has access to a rich context-service-enabled data lake and a wide range of tools that she can use to assess her hypothesis. Janet begins by downloading a free sample of a social media feed. She uses an advanced data catalog application (we'll call it "Catly") which connects to Ground, recognizes the content of her sample, and notifies her that the bank's data lake has a complete feed from the previous month. She then begins using Catly to search the lake for data on customer retention: what is available, and who has access to it? As Janet explores candidate schemas and data samples, Catly retrieves usage data from Ground and notifies her that Sue, from the data-science team, had previously used a database table called cust_roster as input to a Python library called cust_churn. Examining a sample from cust_roster and knowing of Sue's domain expertise, Janet decides to work with that table in her own churn analysis. Having collected the necessary data, Janet turns to a data preparation application ("Preply") to clean and transform the data. The social media data is a JSON document; Preply searches Ground for relevant wrangling scripts and suggests unnesting attributes and pivoting them into tables. Based on security information in Ground, Preply warns Janet that certain customer attributes in her table are protected and may not be used for customer retention analysis. Finally, to join the social media names against the customer names, Preply uses previous wrangling scripts registered with Ground by other analysts to extract standardized keys and suggest join conditions to Janet. Having prepared the data, Janet loads it into her BI charting tool and discovers a strong correlation between customer churn and social sentiment. Janet uses the "share" feature of the BI tool to send it to Sue; the tool records the share in Ground. Sue has been working on a machine learning pipeline for automated discount targeting. Janet's chart has useful features, so Sue consults Ground to find the input data. Sue joins Janet's dataset into her existing training data but discovers that her pipeline's prediction accuracy decreases. Examining Ground's schema for Janet's dataset, Sue realizes that the sentiment column is categorical and needs to be pivoted into indicator columns isPositive, isNegative, and isNeutral. Sue writes a Python script to transform Janet's data into a new file in the required format. She trains a new version of the targeting model and deploys it to send discount offers to customers at risk of leaving. Sue registers her training pipeline including Janet's social media feeds in the daily build; Ground is informed of the new code versions and service registration. After several weeks of improved predictions, Sue receives an alert from Ground about changes in Janet's script; she also sees a notable drop in prediction accuracy of her pipeline. Sue discovers that some of the new social media messages are missing sentiment scores. She queries Ground for the version of the data and pipeline code when sentiment scores first went missing. Upon examination, she sees that the upgrade to the sentiment analysis code produced new categories for which she doesn't have columns (e.g., isAngry, isSad, . . . ). Sue uses Ground to roll back the sentiment analysis code in Janet's pipeline and re-run her pipeline for the past month. This fixes Sue's problem, but Sue wonders if she can simply roll back Janet's scripts in production. Consulting Ground, Sue discovers that other pipelines now depend upon the new version of Janet's scripts. Sue calls a meeting with the relevant stakeholders to untangle the situation. Throughout our scenario, the users and their applications benefited from global data context. Applications like Catly and Preply were able to provide innovative features by mining the "tribal knowledge" captured in Ground: recommending datasets and code, identifying experts, flagging security concerns, notifying developers of changes, etc. The users were provided contextual awareness of both technical and organizational issues and able to interrogate global context to understand root causes. Many of these features exist in isolated applications today, but would work far better with global context. Data context services make this possible, opening up opportunities for innovation, efficiency and better governance. Scenario: Big Data in Enterprise IT Many organizations are not as technical as the one in our previous scenario. We received feedback on an early draft of this paper from an IT executive at a global financial services firm (not affiliated with the authors), who characterized both Janet and Sue as "developers" not analysts. ("If she knows what JSON is, she's a developer!") In his organization, such developers represent less than 10% of the data users. The remaining 90% interact solely with graphical interfaces. However, he sees data context offering enormous benefits to his organization. Here we present an illustrative enterprise IT scenario. Mark is an Data Governance manager working in the IT department of a global bank. He is responsible for a central data warehouse, and the legacy systems that support it, including ExtractTransform-Load (ETL) mappings for loading operational databases into the warehouse, and Master Data Management (MDM) systems for governing the "golden master" of various reference data sets (customers, partner organizations, and so on.) Recently, the bank decided to migrate off of these systems and onto a Big Data stack, to accomodate larger data volumes and greater variety of data. In so doing, they rewrote many of their workflows; the new workflows register their context in Ground. Sara is an analyst in the bank's European Compliance office; she uses Preply to prepare monthly reports for various national governments demonstrating the firm's compliance with regulations like Basel III [33]. As Sara runs this month's AssetAllocation report, she sees that a field called IPRE_AUSNZ came back with a very small value relative to other fields prefixed with IPRE. She submits a request to the IT department's trouble ticket system ("Helply") referencing the report she ran, asking "What is this field? What are the standard values? If it is unusual, can you help me understand why?" Mark receives the ticket in his email, and Helply stores an association in Ground between Sara and AssetAllocation. Mark looks in Ground at summary statistics for the report fields over time, and confirms that the value in that field is historically low by an order of magnitude. Mark then looks at a "data dictionary" of reference data in Ground and sees that IPRE was documented as "Income-Producing Real Estate". He looks at lineage data in Ground and finds that the IPRE_AUSNZ field in the report is calculated by a SQL view aggregating data from both Australia and New Zealand. He also looks at version information for the view behind AssetAllocation, and finds that the view was modified on the second day of the month to compute two new fields, IPRE_AUS and IPRE_NZ that separate the reporting across those geographies. Mark submits a response in Helply that explains this to Sara. Armed with that information, Sara uses the Preply UI to sum all three fields into a single cell representing the IPRE calculation for the pair of countries over the course of the full month. Based on the Helply association, Sara is subscribed automatically to an RSS feed associated with AssetAllocation. In future, Sara will automatically learn about changes that affect the report, thanks to the the new workloads from Mark's team that autogenerate data lineage in Ground. Mark's team takes responsibility for upstream reporting of version changes to data sources (e.g. reference data) and code (ETL scripts, warehouse queries, etc), as well as the data lineage implicit in that code. Using that data lineage, a script written by Mark's team auto-computes downstream Helply alerts for all data products that depend transitively on a change to upstream data and scripts. In this scenario, both the IT and business users benefit from various kinds of context stored in Ground, including statistical data profiles, data dictionaries, field-level data lineage, code version history, and (transitive) associations between people, data, code and their versions. Our previous data science use cases largely exploited statistical and probabilistic aspects of context (correlations, recommendations); in this scenario, the initial motivation was quantitative, but the context was largely used in more deterministic and discrete ways (dependencies, definitions, alerts). Over time time, we believe organizations will leverage data context using both deterministic and probabilistic approaches. DESIGN AND ARCHITECTURE In a decoupled architecture of multiple applications and backend services, context serves as a "narrow waist"-a single point of access for the basic information about data and its usage. It is hard to anticipate the breadth of applications that could emerge. Hence we were keen in designing Ground to focus on initial decisions that could enable new services and applications in future. Design Requirements In our design, we were guided by Postel's Law of Robustness from Internet architecture: "Be conservative in what you do, be liberal in what you accept from others." Guided by this philosophy, we identified four central design requirements for a successful data context service. Model-Agnostic. For a data context service to be broadly adopted, it cannot impose opinions on metadata modeling. Data models evolve and persist over time: modern organizations have to manage everything from COBOL data layouts to RDBMS dumps to XML, JSON, Apache logs and free text. As a result, the context service cannot prescribe how metadata is modeled-each dataset may have different metadata to manage. This is a challenge in legacy "master data" systems, and a weakness in the Big Data stack today: Hive Metastore captures fixed features of relational schemas; HDFS captures fixed features of files. A key challenge in Ground is to design a core metamodel that captures generic information that applies to all data, as well as custom information for different data models, applications, and usage. We explore this issue in Section 3.3. Immutable. Data context must be immutable; updating stored context is tantamount to erasing history. There are multiple reasons why history is critical. The latest context may not always be the most relevant: we may want to replay scenarios from the past for what-if analysis or debugging, or we may want to study how context information (e.g., success rate of a statistical model) changes over time. Prior context may also be important for governance and veracity purposes: we may be asked to audit historical behavior and metadata, or reproduce experimental results published in the past. This simplifies record-keeping, but of course it raises significant engineering challenges. We explore this issue in Section 4. Scalable. It is a frequent misconception that metadata is small. In fact, metadata scaling was already a challenge in previousgeneration ETL technology. In many Big Data settings, it is reasonable to envision the data context being far larger than the data 3) is at the center, supported by a set of swappable underground services. The system is intended to support a growing set of aboveground applications, examples of which are shown. Ground is decoupled from applications and services via asynchronous messaging services. Our initial concrete instantiation of this architecture, Ground 0, is described in Section 4. itself. Usage information is one culprit: logs from a service can often outstrip the data managed by the service. Another is data lineage, which can grow to be extremely large depending on the kind of lineage desired Politically Neutral. Common narrow-waist service like data context must interoperate with a wide range of other services and systems designed and marketed by often competing vendors. Customers will only adopt and support a central data context service if they feel no fear of lock-in; application writers will prioritize support for widely-used APIs to maximize the benefit of their efforts. It is important to note here that open source is not equivalent to political neutrality; customers and developers have to believe that the project leadership has strong incentives to behave in the common interest. Based on the requirements above, the Ground architecture is informed by Postel's Law of Robustness and the design pattern of decoupled components. At its heart is a foundational metamodel called Common Ground with an associated aboveground API for data management applications like the catalog and wrangling examples above. The core functions underneath Ground are provided by swappable component services that plug in via the underground API. A sketch of the architecture of Ground is provided in Key Services Ground's functionality is backed by five decoupled subservices, connected via direct REST APIs and a message bus. For agility, we are starting the project using existing open source solutions for each service. We anticipate that some of these will require additional features for our purposes. In this section we discuss the role of each subservice, and highlight some of the research opportunities we foresee. Our initial choices for subservices are described in Section 4. Ingest: Insertion, Crawlers and Queues. Metadata may be pushed into Ground or require crawling; it may arrive interactively via REST APIs or in batches via a message bus. A main design decision is to decouple the systems plumbing of ingest from an extensible set of metadata and feature extractors. To this end, ingest has both underground and aboveground APIs. New context metadata arrives for ingestion into Ground via an underground queue API from crawling services, or via an aboveground REST API from applications. As metadata arrives, Ground publishes notifications via an aboveground queue. aboveground applications can subscribe to these events to add unique value, fetching the associated metadata and data, and generating enhanced metadata asynchronously. For example, an application can subscribe for file crawl events, hand off the files to an entity extraction system like OpenCalais or AlchemyAPI, and subsequently tag the corresponding Common Ground metadata objects with the extracted entities. Metadata feature extraction is an active research area; we hope that commodity APIs for scalable data crawling and ingest will drive more adoption and innovation in this area. Versioned Metadata Storage. Ground must be able to efficiently store and retrieve metadata with the full richness of the Common Ground metamodel, including flexible version management of code and data, general-purpose model graphs and lineage storage. While none of the existing open source DBMSs target this data model, one can implement it in a shim layer above many of them. We discuss this at greater length in Section 4.1, where we examine a range of widely-used open source DBMSs. As noted in that section, we believe this is an area for significant database research. Search and Query. Access to context information in Ground is expected to be complex and varied. As is noted later, Common Ground supports arbitrary tags, which leads to a requirement for search-style indexing that in current open source is best served by an indexing service outside the storage system. Second, intelligent applications like those in Section 2 will run significant analytical workloads over metadata-especially usage metadata which could be quite large. Third, the underlying graphs in the Common Ground model require support for basic graph queries like transitive closures. Finally, it seems natural that some workloads will need to combine these three classes of queries. As we explore in Section 4.1, various open-source solutions can address these workloads at some level, but there is significant opportunity for research here. Authentication and Authorization. Identity management and authorization are required for a context service, and must accommodate typical packages like LDAP and Kerberos. Note that authorization needs vary widely: the policies of a scientific consortium will differ from a defense agency or a marketing department. Ground's flexible metamodel can support a variety of relevant metadata (ownership, content labels, etc.) Meanwhile, the role of versioning raises subtle security questions. Suppose the authorization policies of a past time are considered unsafe today-should reproducibility and debugging be disallowed? More research is needed integrate versions and lineage with security techniques like Information Flow Control
C3: Internet-Scale Control Plane for Video Quality Optimization
"... As Internet video goes mainstream, we see increas-ing user expectations for higher video quality and new global policy requirements for content providers. In-spired by the case for centralizing network-layer control, we present C3, a control system for optimizing Internet video delivery. The design ..."
Abstract
- Add to MetaCart
(Show Context)
As Internet video goes mainstream, we see increas-ing user expectations for higher video quality and new global policy requirements for content providers. In-spired by the case for centralizing network-layer control, we present C3, a control system for optimizing Internet video delivery. The design of C3 addresses key chal-lenges in ensuring scalability and tackling data plane het-erogeneity. First, to ensure scalability and responsive-ness, C3 introduces a novel split control plane architec-ture that can tolerate a small increase in model staleness for a dramatic increase in scalability. Second, C3 sup-ports diverse client-side platforms via a minimal client-side sensing/actuation layer and offloads complex mon-itoring and control logic to the control plane. C3 has been operational for eight years, and today handles more than 100M sessions per day from 244 countries for 100+ content providers and has improved the video quality sig-nificantly. In doing so, C3 serves as a proof point of the viability of fine-grained centralized control for Internet-scale applications. Our experiences reinforce the case for centralizing control with the continued emergence of new use case pulls (e.g., client diversity) and technology pushes (e.g., big data platforms). 1
NEXT: A System for Real-World Development, Evaluation, and Application of Active Learning
"... Active learning methods automatically adapt data collection by selecting the most informative samples in order to accelerate machine learning. Because of this, real-world testing and comparing active learning algorithms requires collecting new datasets (adaptively), rather than simply applying algor ..."
Abstract
- Add to MetaCart
(Show Context)
Active learning methods automatically adapt data collection by selecting the most informative samples in order to accelerate machine learning. Because of this, real-world testing and comparing active learning algorithms requires collecting new datasets (adaptively), rather than simply applying algorithms to benchmark datasets, as is the norm in (passive) machine learning research. To facilitate the development, testing and deployment of active learning for real applications, we have built an open-source software system for large-scale active learning research and experimentation. The system, called NEXT, provides a unique platform for real-world, reproducible active learning research. This paper details the challenges of building the system and demonstrates its capabilities with several experiments. The results show how experimentation can help expose strengths and weaknesses of active learning algorithms, in sometimes unexpected and enlightening ways. 1