Chromium Code Reviews| OLD | NEW |
|---|---|
| (Empty) | |
| 1 #!/usr/bin/python | |
| 2 # Copyright (c) 2011 The Chromium Authors. All rights reserved. | |
| 3 # Use of this source code is governed by a BSD-style license that can be | |
| 4 # found in the LICENSE file. | |
| 5 | |
| 6 """Takes in a dataset profiles file and outputs to a dictionary list format | |
|
dennisjeffrey
2011/02/11 00:53:17
The first line of this comment should be a 1-line
dyu1
2011/02/16 03:17:31
Done.
| |
| 7 for converting Autofill profile datasets. | |
| 8 | |
| 9 Used for test autofill.AutoFillTest.testMergeDuplicateProfilesInAutofill. | |
| 10 """ | |
| 11 | |
| 12 import re | |
| 13 import codecs | |
| 14 import sys | |
| 15 import os | |
|
dennisjeffrey
2011/02/11 00:53:17
These should be specified in alphabetical order.
dyu1
2011/02/16 03:17:31
Done.
| |
| 16 | |
| 17 | |
| 18 class DatasetConverter(object): | |
| 19 def __init__(self, input_filename, output_filename = None, | |
| 20 display_nothing = True, display_input_lines = False, | |
| 21 display_converted_lines = False): | |
|
dennisjeffrey
2011/02/11 00:53:17
Don't put spaces around the "=" when you're defini
dennisjeffrey
2011/02/11 00:53:17
Using the "logging" module with different verbosit
dyu1
2011/02/16 03:17:31
Done.
| |
| 22 """Constructs a dataset converter object. | |
| 23 | |
| 24 Full input pattern: | |
| 25 '(?P<NAME_FIRST>.*?)\|(?P<MIDDLE_NAME>.*?)\|(?P<NAME_LAST>.*?)\| | |
| 26 (?P<EMAIL_ADDRESS>.*?)\|(?P<COMPANY_NAME>.*?)\|(?P<ADDRESS_HOME_LINE1>.*?) | |
| 27 \|(?P<ADDRESS_HOME_LINE2>.*?)\|(?P<ADDRESS_HOME_CITY>.*?)\| | |
| 28 (?P<ADDRESS_HOME_STATE>.*?)\|(?P<ADDRESS_HOME_ZIP>.*?)\| | |
| 29 (?P<ADDRESS_HOME_COUNTRY>.*?)\| | |
| 30 (?P<PHONE_HOME_WHOLE_NUMBER>.*?)\|(?P<PHONE_FAX_WHOLE_NUMBER>.*?)$' | |
| 31 | |
| 32 Full ouput pattern: | |
| 33 "{u'NAME_FIRST': u'%s', u'NAME_MIDDLE': u'%s', u'NAME_LAST': u'%s', | |
| 34 u'EMAIL_ADDRESS': u'%s', u'COMPANY_NAME': u'%s', u'ADDRESS_HOME_LINE1': | |
| 35 u'%s', u'ADDRESS_HOME_LINE2': u'%s', u'ADDRESS_HOME_CITY': u'%s', | |
| 36 u'ADDRESS_HOME_STATE': u'%s', u'ADDRESS_HOME_ZIP': u'%s', | |
| 37 u'ADDRESS_HOME_COUNTRY': u'%s', u'PHONE_HOME_WHOLE_NUMBER': u'%s', | |
| 38 u'PHONE_FAX_WHOLE_NUMBER': u'%s',}," | |
| 39 | |
| 40 The pattern is a regular expression which has named parenthesis groups | |
|
Nirnimesh
2011/02/11 19:39:54
I think the input/output pattern above is illustra
dyu1
2011/02/16 03:17:31
Done.
| |
| 41 like this (?P<name>...) in order to match the '|' separated fields. | |
| 42 If we had only the NAME_FIRST and NAME_MIDDLE fields (e.g 'Jared|JV') our | |
| 43 pattern would be: "(?P<NAME_FIRST>.*?)\|(?P<NAME_MIDDLE>.*?)$" | |
| 44 | |
| 45 This means that '(?P<NAME_FIRST> regexp)\|' matches whatever regular | |
| 46 expression is inside the parentheses, and indicates the start and end of a | |
| 47 group; the contents of a group can be retrieved after a match has been | |
| 48 performed using the symbolic group name 'NAME_FIRST'. | |
| 49 | |
| 50 The regexp is '.*?'. '.*' which means to match 0 or more repetitions of any | |
| 51 character. The following '?' makes the regexp non-greedy meaning it will | |
| 52 stop at the first occurrence of the '|' character (escaped in the pattern). | |
| 53 | |
| 54 For '(?P<NAME_MIDDLE>.*?)$' there is no '|' at the end, so we have '$' to | |
| 55 indicate the end of the line. | |
| 56 | |
| 57 From the full pattern, we construct once from the FIELDS list. | |
| 58 | |
| 59 The out_line_pattern for one field: "{u'NAME_FIRST': u'%s'," | |
| 60 is ready to accept the value for the 'NAME_FIRST' field once it is extracted | |
| 61 from an input line using the above group pattern. | |
| 62 | |
| 63 'pattern' is used in CreateDictionaryFromRecord(line) to construct and | |
| 64 return a dictionary from a line. | |
| 65 | |
| 66 'out_line_pattern' is used in 'convert()' to construct the final dataset | |
| 67 line that will be printed to the output file. | |
| 68 | |
| 69 Args: | |
| 70 input_filename: name and path of the input dataset. | |
| 71 output_filename: name and path of the converted file, default is None. | |
| 72 display_nothing: output display on the screen, default is True. | |
| 73 display_input_lines: output display of the inpute file, default is False. | |
| 74 display_converted_lines: output display of the converted file, | |
| 75 default is False. | |
| 76 """ | |
| 77 self._fields = [ | |
| 78 u'NAME_FIRST', | |
| 79 u'NAME_MIDDLE', | |
| 80 u'NAME_LAST', | |
| 81 u'EMAIL_ADDRESS', | |
| 82 u'COMPANY_NAME', | |
| 83 u'ADDRESS_HOME_LINE1', | |
| 84 u'ADDRESS_HOME_LINE2', | |
| 85 u'ADDRESS_HOME_CITY', | |
| 86 u'ADDRESS_HOME_STATE', | |
| 87 u'ADDRESS_HOME_ZIP', | |
| 88 u'ADDRESS_HOME_COUNTRY', | |
| 89 u'PHONE_HOME_WHOLE_NUMBER', | |
| 90 u'PHONE_FAX_WHOLE_NUMBER', | |
| 91 ] | |
|
dennisjeffrey
2011/02/11 00:53:17
Since _fields is just a constant array, would it b
dyu1
2011/02/16 03:17:31
Done.
| |
| 92 self._output_pattern = u"{" | |
|
Nirnimesh
2011/02/11 19:39:54
prefer single quote char '
dyu1
2011/02/16 03:17:31
Done.
| |
| 93 for key in self._fields: | |
| 94 self._output_pattern += u"u'%s': u'%s', " %(key, "%s") | |
|
dennisjeffrey
2011/02/11 00:53:17
I think this could be re-written like this:
self.
dyu1
2011/02/16 03:17:31
Done.
| |
| 95 self._output_pattern = self._output_pattern[:-1] + "},\n" | |
| 96 | |
| 97 self._input_filename = input_filename | |
|
dennisjeffrey
2011/02/11 00:53:17
We should probably check to ensure that input_file
dyu1
2011/02/16 03:17:31
Done.
| |
| 98 self._output_filename = output_filename | |
| 99 self._display_nothing = display_nothing | |
| 100 self._display_input_lines = display_input_lines | |
| 101 self._display_converted_lines = display_converted_lines | |
| 102 self._record_length = len(self._fields) | |
|
dennisjeffrey
2011/02/11 00:53:17
Perhaps we could remove this variable and just rep
dyu1
2011/02/16 03:17:31
Done.
| |
| 103 | |
| 104 def CreateDictionaryFromRecord(self, line): | |
|
dennisjeffrey
2011/02/11 00:53:17
If this function is only used by the _Convert() fu
dyu1
2011/02/16 03:17:31
Done.
| |
| 105 """Constructs and returns a dictionary from a record in the dataset file. | |
| 106 Escapes single quotation first and uses split('|') to separate values. | |
|
dennisjeffrey
2011/02/11 00:53:17
This first line of the comment should be a 1-line
dyu1
2011/02/16 03:17:31
Done.
| |
| 107 | |
| 108 Example: | |
| 109 Take an argument as a string u'John|Doe|Mountain View' | |
| 110 and returns a dictionary | |
| 111 { | |
| 112 u'NAME_FIRST': u'John', | |
| 113 u'NAME_LAST': u'Doe', | |
| 114 u'ADDRESS_HOME_CITY': u'Mountain View', | |
| 115 } | |
| 116 | |
| 117 Arg: | |
|
dennisjeffrey
2011/02/11 00:53:17
"Arg" --> "Args"
(I think it should be "Args" eve
dyu1
2011/02/16 03:17:31
Done.
| |
| 118 line: row of record from the dataset file. | |
|
dennisjeffrey
2011/02/11 00:53:17
Since this method returns something, you should ha
dyu1
2011/02/16 03:17:31
Done.
| |
| 119 """ | |
| 120 # Ignore irrelevant record lines such as comment lines. | |
|
dennisjeffrey
2011/02/11 00:53:17
Besides comment lines, what other lines are consid
dyu1
2011/02/16 03:17:31
Done.
| |
| 121 if not '|' in line: | |
|
dennisjeffrey
2011/02/11 00:53:17
What if a comment contains a "|" character? Then
dyu1
2011/02/16 03:17:31
No, I have a check in place (line 129) where it ch
dennis_jeffrey
2011/02/16 19:43:29
Oh, ok. I didn't realize that each line is expect
| |
| 122 return | |
|
dennisjeffrey
2011/02/11 00:53:17
Is it possible to have a valid line that does not
dyu1
2011/02/16 03:17:31
Well the dataset given to me is in the following f
dennis_jeffrey
2011/02/16 19:43:29
Ok, I see. I was thinking that in general, a reco
| |
| 123 re_pattern = re.compile("'", re.UNICODE) | |
| 124 line = re_pattern.sub(r"\'", line) | |
|
dennisjeffrey
2011/02/11 00:53:17
You might want to add a comment to describe what y
dyu1
2011/02/16 03:17:31
Done.
dennis_jeffrey
2011/02/16 19:43:29
Oops, sorry - Now that I see your comment, I reali
| |
| 125 | |
| 126 line_list = line.split('|') | |
| 127 if line_list: | |
| 128 # Check for case when a line may have more or less fields than expected. | |
| 129 if len(line_list) != self._record_length: | |
| 130 print >> sys.stderr, "Error: a '|' seperated line has %d fields \ | |
| 131 instead of %d" % (len(line_list), self._record_length) | |
| 132 print >> sys.stderr, "\t%s" % line | |
| 133 return | |
|
dennisjeffrey
2011/02/11 00:53:17
How about raising an exception rather than just re
dyu1
2011/02/16 03:17:31
Done for logging.
If I raise an exception here th
dennis_jeffrey
2011/02/16 19:43:29
Ok, I think a logging.warning like what you do now
| |
| 134 out_record = {} | |
| 135 i = 0 | |
| 136 for key in self._fields: | |
| 137 out_record[key] = line_list[i] | |
| 138 i += 1 | |
|
dennisjeffrey
2011/02/11 00:53:17
It looks like here, you're assuming that the order
dyu1
2011/02/16 03:17:31
Yes, since the order of the keys from the order in
| |
| 139 return out_record | |
| 140 | |
| 141 def _Convert(self, input_file, output_file): | |
| 142 """The real conversion takes place here. | |
|
dennisjeffrey
2011/02/11 00:53:17
I think it would be more useful to say what's bein
dyu1
2011/02/16 03:17:31
Done.
| |
| 143 | |
| 144 Args: | |
| 145 input_file: dataset input file. | |
| 146 output_file: the converted dictionary list output file. | |
|
dennisjeffrey
2011/02/11 00:53:17
Since this function returns something, you need a
dyu1
2011/02/16 03:17:31
Done.
| |
| 147 """ | |
| 148 list_of_dict = [] | |
| 149 i = 0 | |
| 150 if output_file: | |
| 151 output_file.write("[") | |
| 152 output_file.write(os.linesep) | |
| 153 for line in input_file.readlines(): | |
| 154 line = line.strip() | |
| 155 if not line: | |
| 156 continue | |
| 157 line = unicode(line, 'UTF-8') | |
| 158 output_record = self.CreateDictionaryFromRecord(line) | |
| 159 if output_record: | |
| 160 i += 1 | |
| 161 list_of_dict.append(output_record) | |
| 162 output_line = self._output_pattern %tuple( | |
|
dennisjeffrey
2011/02/11 00:53:17
Put a space after the "%".
dyu1
2011/02/16 03:17:31
Done.
| |
| 163 [output_record[key] for key in self._fields]) | |
| 164 if output_file: | |
| 165 output_file.write(output_line) | |
| 166 output_file.write(os.linesep) | |
| 167 if not self._display_nothing: | |
| 168 if self._display_input_lines: | |
| 169 print "\n%d: %s" %(i, line.encode(sys.stdout.encoding, 'ignore')) | |
|
dennisjeffrey
2011/02/11 00:53:17
Put a space after the "%".
dyu1
2011/02/16 03:17:31
Done.
| |
| 170 if self._display_converted_lines: | |
| 171 print "\tconverted to: %s" %output_line.encode( | |
|
dennisjeffrey
2011/02/11 00:53:17
You may want to consider using the "logging" modul
dennisjeffrey
2011/02/11 00:53:17
Put a space after the "%".
dyu1
2011/02/16 03:17:31
Done.
| |
| 172 sys.stdout.encoding, 'ignore') | |
| 173 else: | |
| 174 if not self._display_input_lines and not i % 10: | |
| 175 print "\t%d lines converted so far!" %i | |
|
dennisjeffrey
2011/02/11 00:53:17
Put a space after the "%".
dennisjeffrey
2011/02/11 00:53:17
I assume all lines should be converted nearly inst
| |
| 176 if output_file: | |
| 177 output_file.write("]") | |
| 178 output_file.write(os.linesep) | |
| 179 if not self._display_nothing: | |
| 180 print | |
| 181 print "%d lines converted SUCCESSFULLY!" %i | |
|
dennisjeffrey
2011/02/11 00:53:17
Put a space after the "%".
dyu1
2011/02/16 03:17:31
Done.
| |
| 182 print "--- FINISHED ---" | |
| 183 print | |
|
dennisjeffrey
2011/02/11 00:53:17
Again, consider using "logging" instead of "print"
dyu1
2011/02/16 03:17:31
Done.
| |
| 184 return list_of_dict | |
| 185 | |
| 186 def Convert(self): | |
| 187 """Takes arguments of two file names and creates two file objects, then | |
|
dennisjeffrey
2011/02/11 00:53:17
This method actually doesn't take any parameter ar
dyu1
2011/02/16 03:17:31
Done.
| |
| 188 calls _Convert() with these two file objects to do the real conversion.""" | |
|
dennisjeffrey
2011/02/11 00:53:17
The first comment line should be a 1-line summary
dyu1
2011/02/16 03:17:31
Done.
| |
| 189 with open(self._input_filename) as input_file: | |
| 190 if self._output_filename: | |
| 191 with codecs.open(self._output_filename, mode = 'wb', | |
| 192 encoding = 'utf-8-sig') as output_file: | |
|
dennisjeffrey
2011/02/11 00:53:17
Remove the spaces around the "=" when specifying t
dyu1
2011/02/16 03:17:31
Done.
| |
| 193 return self._Convert(input_file, output_file) | |
| 194 else: | |
| 195 return self._Convert(input_file, None) | |
| 196 | |
|
dennisjeffrey
2011/02/11 00:53:17
Should have an extra blank line here: the style gu
dyu1
2011/02/16 03:17:31
Done.
| |
| 197 def main(): | |
| 198 c = DatasetConverter(r'../data/autofill/dataset.txt', | |
|
dennisjeffrey
2011/02/11 00:53:17
Is it better to hard-code the input filename and o
dyu1
2011/02/16 03:17:31
Well command-line input would be find for the stan
dennis_jeffrey
2011/02/16 19:43:29
When this module is invoked via the PyAuto test, t
| |
| 199 r'../data/autofill/dataset_duplicate-profiles.txt') | |
|
dennisjeffrey
2011/02/11 00:53:17
The second argument should line up underneath the
dyu1
2011/02/16 03:17:31
Done.
| |
| 200 c.Convert() | |
| 201 | |
| 202 if __name__ == '__main__': | |
| 203 main() | |
| OLD | NEW |