chrome/browser/safe_browsing/prefix_set.h - Issue 6286072: PrefixSet as an alternate to BloomFilter for safe-browsing.

Side by Side Diff: chrome/browser/safe_browsing/prefix_set.h

Issue 6286072: PrefixSet as an alternate to BloomFilter for safe-browsing. (Closed) Base URL: svn://svn.chromium.org/chrome/trunk/src

Patch Set: Example of how a specific set would be stored. Created 9 years, 10 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch | Annotate | Revision Log

OLD	NEW
(Empty)
	1 // Copyright (c) 2011 The Chromium Authors. All rights reserved.

	2 // Use of this source code is governed by a BSD-style license that can be

	3 // found in the LICENSE file.

	4 //

	5 // A read-only set implementation for \|SBPrefix\| items. Prefixes are

	6 // sorted and stored as 16-bit deltas from the previous prefix. An

	7 // index structure provides quick random access, and also handles

	8 // cases where 16 bits cannot encode a delta.

	9 //

	10 // For example, the sequence {20, 25, 41, 65432, 150000, 160000} would

	11 // be stored as:

	12 // A pair {20, 0} in \|index_\|.

	13 // 5, 16, 65391 in \|deltas_\|.

	14 // A pair {150000, 3} in \|index_\|.

	15 // 10000 in \|deltas_\|.

	16 // \|index_.size()\| will be 2, \|deltas_.size()\| will be 4.

	17 //

	18 // This structure is intended for storage of sparse uniform sets of

	19 // prefixes of a certain size. As of this writing, my safe-browsing

	20 // database contains:

	21 // 653132 add prefixes

	22 // 6446 are duplicates (from different chunks)

	23 // 24301 w/in 2^8 of the prior prefix

	24 // 622337 w/in 2^16 of the prior prefix

	25 // 47 further than 2^16 from the prior prefix

	26 // For this input, the memory usage is approximately 2 bytes per

	27 // prefix, a bit over 1.2M. The bloom filter used 25 bits per prefix,

	28 // a bit over 1.9M on this data.

	29 //

	30 // Experimenting with random selections of the above data, storage

	31 // size drops almost linearly as prefix count drops, until the index

	32 // overhead starts to become a problem a bit under 200k prefixes. The

	33 // memory footprint gets worse than storing the raw prefix data around

	34 // 75k prefixes. Fortunately, the actual memory footprint also falls.

	35 // If the prefix count increases the memory footprint should increase

	36 // approximately linearly. The worst-case would be 2^16 items all

	37 // 2^16 apart, which would need 512k (versus 256k to store the raw

	38 // data).

	39 //

	40 // TODO(shess): Write serialization code. Something like this should

	41 // work:

	42 // 4 byte magic number

	43 // 4 byte version number

	44 // 4 byte \|index_.size()\|

	45 // 4 byte \|deltas_.size()\|

	46 // n * 8 byte \|&index_[0]..&index_[n]\|

	47 // m * 2 byte \|&deltas_[0]..&deltas_[m]\|

	48 // 16 byte digest

	49

	50 #ifndef CHROME_BROWSER_SAFE_BROWSING_PREFIX_SET_H_

	51 #define CHROME_BROWSER_SAFE_BROWSING_PREFIX_SET_H_

	52 #pragma once

	53

	54 #include <vector>

	55

	56 #include "chrome/browser/safe_browsing/safe_browsing_util.h"

	57

	58 namespace safe_browsing {

	59

	60 class PrefixSet {

	61 public:

	62 explicit PrefixSet(const std::vector<SBPrefix>& prefixes);

	63

	64 // \|true\| if \|prefix\| was in \|prefixes\| passed to the constructor.

	65 bool Exists(SBPrefix prefix) const;

	66

	67 private:

	68 // Maximum delta that can be encoded in a 16-bit unsigned.

	69 static const unsigned kMaxDelta = 256 * 256;

	70

	71 // Maximum number of consecutive deltas to encode before generating

	72 // a new index entry. This helps keep the worst-case performance

	73 // for \|Exists()\| under control.

	74 static const size_t kMaxRun = 100;

	75

	76 // Top-level index of prefix to offset in \|deltas_\|. Each pair

	77 // indicates a base prefix and where the deltas from that prefix

	78 // begin in \|deltas_\|. The deltas for a pair end at the next pair's

	79 // index into \|deltas_\|.

	80 std::vector<std::pair<SBPrefix,size_t> > index_;

	81

	82 // Deltas which are added to the prefix in \|index_\| to generate

	83 // prefixes. Deltas are only valid between consecutive items from

	84 // \|index_\|, or the end of \|deltas_\| for the last \|index_\| pair.

	85 std::vector<uint16> deltas_;

	86

	87 DISALLOW_COPY_AND_ASSIGN(PrefixSet);

	88 };

	89

	90 } // namespace safe_browsing

	91

	92 #endif // CHROME_BROWSER_SAFE_BROWSING_PREFIX_SET_H_

OLD	NEW

« no previous file with comments | « no previous file | chrome/browser/safe_browsing/prefix_set.cc » ('j') | no next file with comments »