content/browser/renderer_host/duplicate_resource_handler.cc - Issue 10701151: DuplicateContentResourceHandler to monitor resources and track how many times th…

Side by Side Diff: content/browser/renderer_host/duplicate_resource_handler.cc

Issue 10701151: DuplicateContentResourceHandler to monitor resources and track how many times th… (Closed) Base URL: http://src.chromium.org/svn/trunk/src/

Patch Set: Created 8 years, 5 months ago

Use n/p to move between diff chunks; N/P to move between comments. Draft comments are only viewable by you.

Jump to:

View unified diff | Download patch | Annotate | Revision Log

« content/browser/renderer_host/duplicate_resource_handler.h ('K') | « content/browser/renderer_host/duplicate_resource_handler.h ('k') | content/browser/renderer_host/resource_dispatcher_host_impl.cc » ('j') | content/browser/renderer_host/resource_dispatcher_host_impl.cc » ('J')
Toggle Intra-line Diffs ('i') | Expand Comments ('e') | Collapse Comments ('c') | Hide Comments ('s')

OLD	NEW
(Empty)
	1 // Copyright (c) 2012 The Chromium Authors. All rights reserved.

	2 // Use of this source code is governed by a BSD-style license that can be

	3 // found in the LICENSE file.

	4

	5 #include "content/browser/renderer_host/duplicate_resource_handler.h"

	6

	7 #include <cmath>

	8 #include <cstring>

	9 #include <set>

	10

	11 #include "base/logging.h"

	12 #include "base/metrics/histogram.h"

	13 #include "content/browser/renderer_host/resource_request_info_impl.h"

	14 #include "net/base/io_buffer.h"

	15 #include "net/url_request/url_request.h"

	16 #include "third_party/smhasher/src/PMurHash.h"

	17

	18

	19 namespace content {

	20

	21 namespace{

	22

	23 // This set keeps track of a hash of resources

	24 // that we have seen

	25 std::set<uint32>* GetSetOfHashes() {
	gavinp 2012/07/18 12:13:32 Naming by data type isn't ideal. Better by use: Co Naming by data type isn't ideal. Better by use: ContentHashes maybe? And ContentWithUrlHashes below? frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > Naming by data type isn't ideal. Better by use: ContentHashes maybe? And > ContentWithUrlHashes below? Done.
	26 static std::set<uint32> seen_resources;
	gavinp 2012/07/18 12:13:32 Probably we should bite the bullet, and use base/m Probably we should bite the bullet, and use base/memory/singleton.h for this. frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > Probably we should bite the bullet, and use base/memory/singleton.h for this. Done.
	27 return &seen_resources;

	28 }

	29

	30 // This set keeps track of hash of resources based on origin

	31 // that we have seen previously

	32 std::set<uint32>* GetSetOfHashesWithURL(){

	33 static std::set<uint32> seen_resources_with_url;

	34 return &seen_resources_with_url;

	35 }

	36

	37 } // namespace

	38

	39 DuplicateResourceHandler::DuplicateResourceHandler(

	40 scoped_ptr<ResourceHandler> next_handler,

	41 ResourceType::Type resource_type,

	42 net::URLRequest* request)

	43 : LayeredResourceHandler(next_handler.Pass()),

	44 resource_type_(resource_type),

	45 ph1_(0),

	46 pcarry_(0),

	47 buffer_size_(0),

	48 bytes_read_(0),

	49 request_(request) {

	50 }

	51

	52 DuplicateResourceHandler::~DuplicateResourceHandler() {

	53 }

	54

	55 bool DuplicateResourceHandler::OnWillRead(int request_id, net::IOBuffer** buf,

	56 int* buf_size, int min_size) {

	57 DCHECK_EQ(-1, min_size);

	58

	59 if (!next_handler_->OnWillRead(request_id, buf, buf_size, min_size))

	60 return false;

	61
	gavinp 2012/07/18 12:13:32 Lose this blank line. Lose this blank line. frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > Lose this blank line. Done.
	62 read_buffer_ = *buf;

	63 buffer_size_ = *buf_size;

	64 return true;

	65 }

	66

	67 bool DuplicateResourceHandler::OnReadCompleted(int request_id, int bytes_read,

	68 bool* defer) {

	69

	70 PMurHash32_Process(&ph1_,&pcarry_,read_buffer_->data(), bytes_read);
	gavinp 2012/07/18 12:13:32 spaces after commas. spaces after commas. frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > spaces after commas. Done.
	71 bytes_read_ += bytes_read;

	72

	73 return next_handler_->OnReadCompleted(request_id, bytes_read, defer);

	74 }

	75

	76 bool DuplicateResourceHandler::OnResponseCompleted(

	77 int request_id,

	78 const net::URLRequestStatus& status,

	79 const std::string& security_info) {
	gavinp 2012/07/18 12:13:32 What should you do for status != net::URLRequestSt What should you do for status != net::URLRequestStatus::SUCCESS ? frankwang 2012/07/19 16:10:26 I put in a check so that I pass through when it is Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > What should you do for status != net::URLRequestStatus::SUCCESS ? I put in a check so that I pass through when it is not SUCCESS so the next handler can deal with it.
	80

	81 uint32 resource_hash = PMurHash32_Result(ph1_, pcarry_, bytes_read_);

	82

	83 // Hash url into the resource to see whether it is

	84 // from the same or different origin
	gavinp 2012/07/18 12:13:32 Comments should be sentences (have a period) and w Comments should be sentences (have a period) and wrap at 80 columns. Origin isn't the right name for what you're talking about, the origin of a resource is a triple (scheme, hostname, portnumber) usually, as in ('http', 'foo.com', 80) is the origin of http://foo.com/bar/quux?blatto#thisisafragment. You want to compare the url. Of note also is that the urls you get will never have fragments. frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > Comments should be sentences (have a period) and wrap at 80 columns. Origin > isn't the right name for what you're talking about, the origin of a resource is > a triple (scheme, hostname, portnumber) usually, as in ('http', 'foo.com', 80) > is the origin of http://foo.com/bar/quux?blatto#thisisafragment. You want to > compare the url. Of note also is that the urls you get will never have > fragments. Done.
	85 uint32 hashed_with_url;

	86 const char* url = request_->url().spec().c_str();

	87 int url_length = strlen(url);
	gavinp 2012/07/18 12:13:32 This scares me a bit. Is it safe to trust c_str() This scares me a bit. Is it safe to trust c_str() after destruction of what may have been a temporary std::string? And why put the null on the string anyway, isn't std::string::data() good enough? I also don't like the O(n) strlen() when we had a C++ string which stored explicit length earlier. Lastly, you include <cstring>, but don't call std::strlen, and instead rely on one of your includes having included string.h. How about: const std::string url_spec = request_->url().spec(); then use url_spec.data() and url_spec.size() in the call to PMurHash32_Process? frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > This scares me a bit. Is it safe to trust c_str() after destruction of what may > have been a temporary std::string? And why put the null on the string anyway, > isn't std::string::data() good enough? I also don't like the O(n) strlen() when > we had a C++ string which stored explicit length earlier. Lastly, you include > <cstring>, but don't call std::strlen, and instead rely on one of your includes > having included string.h. > > How about: > > const std::string url_spec = request_->url().spec(); > > then use url_spec.data() and url_spec.size() in the call to PMurHash32_Process? Done.
	88 PMurHash32_Process(&ph1_, &pcarry_, url, url_length);

	89 hashed_with_url = PMurHash32_Result(ph1_, pcarry_, url_length + bytes_read_);

	90

	91 DVLOG(4) << "url: " << url;

	92 DVLOG(4) << "resource hash: " << resource_hash;

	93 DVLOG(4) << "hash with url: " << hashed_with_url;

	94

	95 // This boolean answers whether we found resource

	96 // based just on hash

	97 const bool did_we_find_resource =
	gavinp 2012/07/18 12:13:32 did_match_contents_ maybe? did_match_contents_ maybe? frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > did_match_contents_ maybe? Done.
	98 GetSetOfHashes()->find(resource_hash) !=
	gavinp 2012/07/18 12:13:32 Use count(resource_hash) instead. You don't need t Use count(resource_hash) instead. You don't need to compare to end(), and it makes it more clear your testing for existence. That should make the comment above moot, too. Same just below. Also, might as well grab a ptr to this set rather than use this method, for easier reading: std::set<uint32>* content_hashes = GetSetOfHashes(); ...
	99 GetSetOfHashes()->end();

	100

	101 // This boolean checks whether we found a resource from the original url

	102 // as one previously seen

	103 const bool did_we_find_resource_original_url =
	gavinp 2012/07/18 12:13:32 I've thought about it more, and now I don't like t I've thought about it more, and now I don't like this name, it sounds like you are checking if you found an URL match, when really you're checking URL and Data. did_match_contents_and_url_ ? frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > I've thought about it more, and now I don't like this name, it sounds like you > are checking if you found an URL match, when really you're checking URL and > Data. did_match_contents_and_url_ ? Done.
	104 GetSetOfHashesWithURL()->find(hashed_with_url) !=

	105 GetSetOfHashesWithURL()->end();

	106

	107 // If we found the resource, classify whether it is

	108 // from the same url or different
	gavinp 2012/07/18 12:13:32 Best practice is to have a single instance of each Best practice is to have a single instance of each histogram. It saves memory, and I think makes more readable code. Consider: UMA_HISTOGRAM_BOOLEAN("Duplicate.Hits", did_match_contents); UMA_HISTOGRAM_BOOLEAN("Duplicate.HitsSameUrl", did_match_contents && did_match_contents_and_url); if (did_match_contents && !did_match_contents_and_url) { UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.Size.HashHitUrlMiss", ...) .... frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > Best practice is to have a single instance of each histogram. It saves memory, > and I think makes more readable code. Consider: > > UMA_HISTOGRAM_BOOLEAN("Duplicate.Hits", did_match_contents); > UMA_HISTOGRAM_BOOLEAN("Duplicate.HitsSameUrl", did_match_contents && > did_match_contents_and_url); > if (did_match_contents && !did_match_contents_and_url) { > UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.Size.HashHitUrlMiss", ...) > .... Done.
	109 if (did_we_find_resource) {

	110 // If it is from the original url, it will hit on both caches

	111 if (did_we_find_resource_original_url) {

	112 UMA_HISTOGRAM_BOOLEAN("Duplicate.Hash.Hits", true);
	gavinp 2012/07/18 12:13:32 I've thought about this a bit more: I think you ju I've thought about this a bit more: I think you just want Duplicate.Hits and Duplicate.HitsSameUrl . Sorts better (the two varieties will sort right next to each other), and shorter. frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > I've thought about this a bit more: I think you just want Duplicate.Hits and > Duplicate.HitsSameUrl . Sorts better (the two varieties will sort right next to > each other), and shorter. Done.
	113 UMA_HISTOGRAM_BOOLEAN("Duplicate.HashSameUrl.Hits", true);

	114 } else {

	115 // If it is a different url (interesting case), it hits on the

	116 // proposed cache not the current cache

	117 UMA_HISTOGRAM_BOOLEAN("Duplicate.Hash.Hits", true);

	118 UMA_HISTOGRAM_BOOLEAN("Duplicate.HashSameUrl.Hits", false);

	119 // Record bytes missed because we are caching

	120 // based on origin instead of resource

	121 UMA_HISTOGRAM_CUSTOM_COUNTS("Duplicate.HashMiss.Size", bytes_read_,
	gavinp 2012/07/18 12:13:32 I think you want: "Duplicate.Size.HashHitUrlMiss" I think you want: "Duplicate.Size.HashHitUrlMiss" this will sort better in the histogram tool (put all varieties of sizes near each other), and better describes what we have. Do you also want .HashHit and .HashMiss ? .HashMissUrlHit ? frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > I think you want: "Duplicate.Size.HashHitUrlMiss" this will sort better in the > histogram tool (put all varieties of sizes near each other), and better > describes what we have. Do you also want .HashHit and .HashMiss ? > .HashMissUrlHit ? Done.
	122 1, 0x7FFFFFFF, 50);

	123 // Record resource type for missed resource

	124 UMA_HISTOGRAM_ENUMERATION("Duplicate.ResourceType", resource_type_,
	gavinp 2012/07/18 12:13:32 The name should reflect that it's for missed resou The name should reflect that it's for missed resources. "Duplicate.ResourceType.HashMiss". Do you also want to talk about the MIME type? frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > The name should reflect that it's for missed resources. > "Duplicate.ResourceType.HashMiss". Do you also want to talk about the MIME > type? Done.
	125 ResourceType::LAST_TYPE);

	126 GetSetOfHashesWithURL()->insert(hashed_with_url);

	127 }

	128 } else {

	129 // We did not see the resource so it is a miss on both caches
	gavinp 2012/07/18 12:13:32 I don't like comments like this. The logic should I don't like comments like this. The logic should be concise enough to make this clear. frankwang 2012/07/19 16:10:26 Done. Show quoted text On 2012/07/18 12:13:32, gavinp wrote: > I don't like comments like this. The logic should be concise enough to make this > clear. Done.
	130 UMA_HISTOGRAM_BOOLEAN("Duplicate.Hash.Hits", false);

	131 UMA_HISTOGRAM_BOOLEAN("Duplicate.HashSameUrl.Hits", false);

	132 GetSetOfHashes()->insert(resource_hash);

	133 GetSetOfHashesWithURL()->insert(hashed_with_url);

	134 }

	135

	136 bytes_read_ = 0;

	137 read_buffer_ = NULL;

	138 return next_handler_->OnResponseCompleted(request_id, status, security_info);

	139 }

	140

	141 } //namespace content

OLD	NEW