Your email was sent successfully. Check your inbox.

An error occurred while sending the email. Please try again.

Proceed reservation?

Export
  • 1
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2009
    In:  Communications of the ACM Vol. 52, No. 10 ( 2009-10), p. 97-105
    In: Communications of the ACM, Association for Computing Machinery (ACM), Vol. 52, No. 10 ( 2009-10), p. 97-105
    Abstract: Many data generation processes can be modeled as data streams. They produce huge numbers of pieces of data, each of which is simple in isolation, but which taken together lead to a complex whole. For example, the sequence of queries posed to an Internet search engine can be thought of as a stream, as can the collection of transactions across all branches of a supermarket chain. In aggregate, this data can arrive at enormous rates, easily in the realm of hundreds of gigabytes per day or higher. While this data may be archived and indexed within a data warehouse, it is also important to process the data "as it happens," to provide up to the minute analysis and statistics on current trends. Methods to achieve this must be quick to respond to each new piece of information, and use resources which are very small when compared to the total quantity of data. These applications and others like them have led to the formulation of the so-called "streaming model." In this abstraction, algorithms take only a single pass over their input, and must accurately compute various functions while using resources (space and time per item) that are strictly sublinear in the size of the input---ideally, polynomial in the logarithm of the input size. The output must be produced at the end of the stream, or when queried on the prefix of the stream that has been observed so far. (Other variations ask for the output to be maintained continuously in the presence of updates, or on a "sliding window" of only the most recent updates.) Some problems are simple in this model: for example, given a stream of transactions, finding the mean and standard deviation of the bill totals can be accomplished by retaining a few "sufficient statistics" (sum of all values, sum of squared values, etc.). Others can be shown to require a large amount of information to be stored, such as determining whether a particular search query has already appeared anywhere within a large stream of queries. Determining which problems can be solved effectively within this model remains an active research area. The frequent items problem (also known as the heavy hitters problem ) is one of the most heavily studied questions in data streams. The problem is popular due to its simplicity to state, and its intuitive interest and value. It is important both in itself, and as a subroutine within more advanced data stream computations. Informally, given a sequence of items, the problem is simply to find those items which occur most frequently. Typically, this is formalized as finding all items whose frequency exceeds a specified fraction of the total number of items. This is shown in Figure 1. Variations arise when the items are given weights, and further when these weights can also be negative. This abstract problem captures a wide variety of settings. The items can represent packets on the Internet, and the weights are the size of the packets. Then the frequent items represent the most popular destinations, or the heaviest bandwidth users (depending on how the items are extracted from the flow identifiers). This knowledge can help in optimizing routing decisions, for in-network caching, and for planning where to add new capacity. Or, the items can represent queries made to an Internet search engine, and the frequent items are now the (currently) popular terms. These are not simply hypothetical examples, but genuine cases where algorithms for this problem have been applied by large corporations: AT & T and Google, respectively. Given the size of the data (which is being generated at high speed), it is important to find algorithms which are capable of processing each new update very quickly, without blocking. It also helps if the working space of the algorithm is very small, so that the analysis can happen over many different groups in parallel, and because small structures are likely to have better cache behavior and hence further help increase the throughput. Obtaining efficient and scalable solutions to the frequent items problem is also important since many streaming applications need to find frequent items as a subroutine of another, more complex computation. Most directly, mining frequent itemsets inherently builds on finding frequent items as a basic building block. Finding the entropy of a stream requires learning the most frequent items in order to directly compute their contribution to the entropy, and remove their contribution before approximating the entropy of the residual stream. The HSS (Hierarchical Sampling from Sketches) technique uses hashing to derive multiple substreams, the frequent elements of which are extracted to estimate the frequency moments of the stream. The frequent items problem is also related to the recently popular area of Compressed Sensing. Other work solves generalized versions of frequent items problems by building on algorithms for the "vanilla" version of the problem. Several techniques for finding the frequent items in a "sliding window" of recent updates (instead of all updates) operate by keeping track of the frequent items in many sub-windows. In the "heavy hitters distinct" problem, with applications to detecting network scanning attacks, the count of an item is the number of distinct pairs containing that item paired with a secondary item. It is typically solved extending a frequent items algorithm with distinct counting algorithms. Frequent items have also been applied to models of probabilistic streaming data, and within faster "skipping" techniques. Thus the problem is an important one to understand and study in order to produce efficient streaming implementations. It remains an active area, with many new research contributions produced every year on the core problem and its variations. Due to the amount of work on this problem, it is easy to miss out some important references or fail to appreciate the properties of certain algorithms. There are several cases where algorithms first published in the 1980s have been "rediscovered" two decades later; existing work is sometimes claimed to be incapable of a certain guarantee, which in truth it can provide with only minor modifications; and experimental evaluations do not always compare against the most suitable methods. In this paper, we present the main ideas in this area, by describing some of the most significant algorithms for the core problem of finding frequent items using common notation and terminology. In doing so, we also present the historical development of these algorithms. Studying these algorithms is instructive, as they are relatively simple, but can be shown to provide formal guarantees on the quality of their output as a function of an accuracy parameter ε. We also provide baseline implementations of many of these algorithms against which future algorithms can be compared, and on top of which algorithms for different problems can be built. We perform experimental evaluation of the algorithms over a variety of data sets to indicate their performance in practice. From this, we are able to identify clear distinctions among the algorithms that are not apparent from their theoretical analysis alone.
    Type of Medium: Online Resource
    ISSN: 0001-0782 , 1557-7317
    RVK:
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2009
    detail.hit.zdb_id: 80254-2
    detail.hit.zdb_id: 2004542-6
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 2
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2008
    In:  Proceedings of the VLDB Endowment Vol. 1, No. 1 ( 2008-08), p. 201-212
    In: Proceedings of the VLDB Endowment, Association for Computing Machinery (ACM), Vol. 1, No. 1 ( 2008-08), p. 201-212
    Abstract: We study selectivity estimation techniques for set similarity queries. A wide variety of similarity measures for sets have been proposed in the past. In this work we concentrate on the class of weighted similarity measures (e.g., TF/IDF and BM25 cosine similarity and variants) and design selectivity estimators based on a priori constructed samples. First, we study the pitfalls associated with straightforward applications of random sampling, and argue that care needs to be taken in how the samples are constructed; uniform random sampling yields very low accuracy, while query sensitive realtime sampling is more expensive than exact solutions (both in CPU and I/O cost). We show how to build robust samples a priori, based on existing synopses for distinct value estimation. We prove the accuracy of our technique theoretically, and verify its performance experimentally. Our algorithm is orders of magnitude faster than exact solutions and has very small space overhead.
    Type of Medium: Online Resource
    ISSN: 2150-8097
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2008
    detail.hit.zdb_id: 2478691-3
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 3
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2008
    In:  Proceedings of the VLDB Endowment Vol. 1, No. 2 ( 2008-08), p. 1530-1541
    In: Proceedings of the VLDB Endowment, Association for Computing Machinery (ACM), Vol. 1, No. 2 ( 2008-08), p. 1530-1541
    Abstract: The frequent items problem is to process a stream of items and find all items occurring more than a given fraction of the time. It is one of the most heavily studied problems in data stream mining, dating back to the 1980s. Many applications rely directly or indirectly on finding the frequent items, and implementations are in use in large scale industrial systems. However, there has not been much comparison of the different methods under uniform experimental conditions. It is common to find papers touching on this topic in which important related work is mischaracterized, overlooked, or reinvented. In this paper, we aim to present the most important algorithms for this problem in a common framework. We have created baseline implementations of the algorithms, and used these to perform a thorough experimental study of their properties. We give empirical evidence that there is considerable variation in the performance of frequent items algorithms. The best methods can be implemented to find frequent items with high accuracy using only tens of kilobytes of memory, at rates of millions of items per second on cheap modern hardware.
    Type of Medium: Online Resource
    ISSN: 2150-8097
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2008
    detail.hit.zdb_id: 2478691-3
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 4
    Online Resource
    Online Resource
    Now Publishers ; 2009
    In:  Foundations and Trends in Databases Vol. 2, No. 4 ( 2009), p. 267-402
    In: Foundations and Trends in Databases, Now Publishers, Vol. 2, No. 4 ( 2009), p. 267-402
    Type of Medium: Online Resource
    ISSN: 1931-7883 , 1931-7891
    Language: English
    Publisher: Now Publishers
    Publication Date: 2009
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 5
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2009
    In:  ACM Transactions on Database Systems Vol. 34, No. 1 ( 2009-04), p. 1-35
    In: ACM Transactions on Database Systems, Association for Computing Machinery (ACM), Vol. 34, No. 1 ( 2009-04), p. 1-35
    Abstract: In the emerging area of sensor-based systems, a significant challenge is to develop scalable, fault-tolerant methods to extract useful information from the data the sensors collect. An approach to this data management problem is the use of sensor database systems, which allow users to perform aggregation queries such as MIN, COUNT, and AVG on the readings of a sensor network. In addition, more advanced queries such as frequency counting and quantile estimation can be supported. Due to energy limitations in sensor-based networks, centralized data collection is generally impractical, so most systems use in-network aggregation to reduce network traffic. However, even these aggregation strategies remain bandwidth-intensive when combined with the fault-tolerant, multipath routing methods often used in these environments. To avoid this expense, we investigate the use of approximate in-network aggregation using small sketches. We present duplicate-insensitive sketching techniques that can be implemented efficiently on small sensor devices with limited hardware support and we analyze both their performance and accuracy. Finally, we present an experimental evaluation that validates the effectiveness of our methods.
    Type of Medium: Online Resource
    ISSN: 0362-5915 , 1557-4644
    RVK:
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2009
    detail.hit.zdb_id: 196155-X
    detail.hit.zdb_id: 2006335-0
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 6
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2007
    In:  ACM Transactions on Database Systems Vol. 32, No. 2 ( 2007-06), p. 8-
    In: ACM Transactions on Database Systems, Association for Computing Machinery (ACM), Vol. 32, No. 2 ( 2007-06), p. 8-
    Abstract: A common problem in many types of databases is retrieving the most similar matches to a query object. Finding these matches in a large database can be too slow to be practical, especially in domains where objects are compared using computationally expensive similarity (or distance) measures. Embedding methods can significantly speed-up retrieval by mapping objects into a vector space, where distances can be measured rapidly using a Minkowski metric. In this article we present a novel way to improve embedding quality. In particular, we propose to construct embeddings that use a query-sensitive distance measure for the target space of the embedding. This distance measure is used to compare those vectors that the query and database objects are mapped to. The term “query-sensitive” means that the distance measure changes, depending on the current query object. We demonstrate theoretically that using a query-sensitive distance measure increases the modeling power of embeddings and allows them to capture more of the structure of the original space. We also demonstrate experimentally that query-sensitive embeddings can significantly improve retrieval performance. In experiments with an image database of handwritten digits and a time-series database, the proposed method outperforms existing state-of-the-art non-Euclidean indexing methods, meaning that it provides significantly better tradeoffs between efficiency and retrieval accuracy.
    Type of Medium: Online Resource
    ISSN: 0362-5915 , 1557-4644
    RVK:
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2007
    detail.hit.zdb_id: 196155-X
    detail.hit.zdb_id: 2006335-0
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 7
    Online Resource
    Online Resource
    Institute of Electrical and Electronics Engineers (IEEE) ; 2007
    In:  IEEE Transactions on Knowledge and Data Engineering Vol. 19, No. 1 ( 2007-01), p. 96-110
    In: IEEE Transactions on Knowledge and Data Engineering, Institute of Electrical and Electronics Engineers (IEEE), Vol. 19, No. 1 ( 2007-01), p. 96-110
    Type of Medium: Online Resource
    ISSN: 1041-4347
    RVK:
    Language: Unknown
    Publisher: Institute of Electrical and Electronics Engineers (IEEE)
    Publication Date: 2007
    detail.hit.zdb_id: 1001468-8
    detail.hit.zdb_id: 2026620-0
    SSG: 24,1
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 8
    Online Resource
    Online Resource
    Springer Science and Business Media LLC ; 2006
    In:  The VLDB Journal Vol. 15, No. 1 ( 2006-1), p. 1-20
    In: The VLDB Journal, Springer Science and Business Media LLC, Vol. 15, No. 1 ( 2006-1), p. 1-20
    Type of Medium: Online Resource
    ISSN: 1066-8888 , 0949-877X
    Language: English
    Publisher: Springer Science and Business Media LLC
    Publication Date: 2006
    detail.hit.zdb_id: 1463009-6
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 9
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2011
    In:  Proceedings of the VLDB Endowment Vol. 4, No. 12 ( 2011-08), p. 1395-1398
    In: Proceedings of the VLDB Endowment, Association for Computing Machinery (ACM), Vol. 4, No. 12 ( 2011-08), p. 1395-1398
    Abstract: With the availability of very large databases, an exploratory query can easily lead to a vast answer set, typically based on an answer's relevance (i.e., top-k, tf-idf ) to the user query. Navigating through such an answer set requires huge effort and users give up after perusing through the first few answers, thus some interesting answers hidden further down the answer set can easily be missed. An approach to address this problem is to present the user with the most diverse among the answers based on some diversity criterion. In this demonstration we present DivDB, a system we built to provide query result diversification both for advanced and novice users. For the experienced users, who may want to test the performance of existing and new algorithms, we provide an SQL-based extension to formulate queries with diversification. As for the novice users, who may be more interested in the result rather than how to tune the various algorithms' parameters, the DivDB system allows the user to provide a "hint" to the optimizer on speed vs. quality of result. Moreover, novice users can use an interface to dynamically change the tradeoff value between relevance and diversity in the result, and thus visually inspect the result as they interact with this parameter. This is a great feature to the end user because finding a good tradeoff value is a very hard task and it depends on several variables (i.e., query parameters, evaluation algorithms, and dataset properties). In this demonstration we show a study of the DivDB system with two image databases that contain many images of the same object under different settings (e.g., different camera angle). We show how the DivDB helps users to iteratively inspect diversification in the query result, without the need to know how to tune the many different parameters of the several existing algorithms in the DivDB system.
    Type of Medium: Online Resource
    ISSN: 2150-8097
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2011
    detail.hit.zdb_id: 2478691-3
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
  • 10
    Online Resource
    Online Resource
    Association for Computing Machinery (ACM) ; 2009
    In:  Proceedings of the VLDB Endowment Vol. 2, No. 2 ( 2009-08), p. 1660-1661
    In: Proceedings of the VLDB Endowment, Association for Computing Machinery (ACM), Vol. 2, No. 2 ( 2009-08), p. 1660-1661
    Abstract: This tutorial provides a comprehensive overview of recent research progress on the important problem of approximate search in string collections. We identify existing indexes, search algorithms, filtering strategies, selectivity-estimation techniques and other work, and comment on their respective merits and limitations.
    Type of Medium: Online Resource
    ISSN: 2150-8097
    Language: English
    Publisher: Association for Computing Machinery (ACM)
    Publication Date: 2009
    detail.hit.zdb_id: 2478691-3
    Library Location Call Number Volume/Issue/Year Availability
    BibTip Others were also interested in ...
Close ⊗
This website uses cookies and the analysis tool Matomo. Further information can be found on the KOBV privacy pages