Chromium Code Reviews| Index: chrome/test/functional/dataset_converter.py |
| =================================================================== |
| --- chrome/test/functional/dataset_converter.py (revision 0) |
| +++ chrome/test/functional/dataset_converter.py (revision 0) |
| @@ -0,0 +1,203 @@ |
| +#!/usr/bin/python |
| +# Copyright (c) 2011 The Chromium Authors. All rights reserved. |
| +# Use of this source code is governed by a BSD-style license that can be |
| +# found in the LICENSE file. |
| + |
| +"""Takes in a dataset profiles file and outputs to a dictionary list format |
|
dennisjeffrey
2011/02/11 00:53:17
The first line of this comment should be a 1-line
dyu1
2011/02/16 03:17:31
Done.
|
| +for converting Autofill profile datasets. |
| + |
| +Used for test autofill.AutoFillTest.testMergeDuplicateProfilesInAutofill. |
| +""" |
| + |
| +import re |
| +import codecs |
| +import sys |
| +import os |
|
dennisjeffrey
2011/02/11 00:53:17
These should be specified in alphabetical order.
dyu1
2011/02/16 03:17:31
Done.
|
| + |
| + |
| +class DatasetConverter(object): |
| + def __init__(self, input_filename, output_filename = None, |
| + display_nothing = True, display_input_lines = False, |
| + display_converted_lines = False): |
|
dennisjeffrey
2011/02/11 00:53:17
Don't put spaces around the "=" when you're defini
dennisjeffrey
2011/02/11 00:53:17
Using the "logging" module with different verbosit
dyu1
2011/02/16 03:17:31
Done.
|
| + """Constructs a dataset converter object. |
| + |
| + Full input pattern: |
| + '(?P<NAME_FIRST>.*?)\|(?P<MIDDLE_NAME>.*?)\|(?P<NAME_LAST>.*?)\| |
| + (?P<EMAIL_ADDRESS>.*?)\|(?P<COMPANY_NAME>.*?)\|(?P<ADDRESS_HOME_LINE1>.*?) |
| + \|(?P<ADDRESS_HOME_LINE2>.*?)\|(?P<ADDRESS_HOME_CITY>.*?)\| |
| + (?P<ADDRESS_HOME_STATE>.*?)\|(?P<ADDRESS_HOME_ZIP>.*?)\| |
| + (?P<ADDRESS_HOME_COUNTRY>.*?)\| |
| + (?P<PHONE_HOME_WHOLE_NUMBER>.*?)\|(?P<PHONE_FAX_WHOLE_NUMBER>.*?)$' |
| + |
| + Full ouput pattern: |
| + "{u'NAME_FIRST': u'%s', u'NAME_MIDDLE': u'%s', u'NAME_LAST': u'%s', |
| + u'EMAIL_ADDRESS': u'%s', u'COMPANY_NAME': u'%s', u'ADDRESS_HOME_LINE1': |
| + u'%s', u'ADDRESS_HOME_LINE2': u'%s', u'ADDRESS_HOME_CITY': u'%s', |
| + u'ADDRESS_HOME_STATE': u'%s', u'ADDRESS_HOME_ZIP': u'%s', |
| + u'ADDRESS_HOME_COUNTRY': u'%s', u'PHONE_HOME_WHOLE_NUMBER': u'%s', |
| + u'PHONE_FAX_WHOLE_NUMBER': u'%s',}," |
| + |
| + The pattern is a regular expression which has named parenthesis groups |
|
Nirnimesh
2011/02/11 19:39:54
I think the input/output pattern above is illustra
dyu1
2011/02/16 03:17:31
Done.
|
| + like this (?P<name>...) in order to match the '|' separated fields. |
| + If we had only the NAME_FIRST and NAME_MIDDLE fields (e.g 'Jared|JV') our |
| + pattern would be: "(?P<NAME_FIRST>.*?)\|(?P<NAME_MIDDLE>.*?)$" |
| + |
| + This means that '(?P<NAME_FIRST> regexp)\|' matches whatever regular |
| + expression is inside the parentheses, and indicates the start and end of a |
| + group; the contents of a group can be retrieved after a match has been |
| + performed using the symbolic group name 'NAME_FIRST'. |
| + |
| + The regexp is '.*?'. '.*' which means to match 0 or more repetitions of any |
| + character. The following '?' makes the regexp non-greedy meaning it will |
| + stop at the first occurrence of the '|' character (escaped in the pattern). |
| + |
| + For '(?P<NAME_MIDDLE>.*?)$' there is no '|' at the end, so we have '$' to |
| + indicate the end of the line. |
| + |
| + From the full pattern, we construct once from the FIELDS list. |
| + |
| + The out_line_pattern for one field: "{u'NAME_FIRST': u'%s'," |
| + is ready to accept the value for the 'NAME_FIRST' field once it is extracted |
| + from an input line using the above group pattern. |
| + |
| + 'pattern' is used in CreateDictionaryFromRecord(line) to construct and |
| + return a dictionary from a line. |
| + |
| + 'out_line_pattern' is used in 'convert()' to construct the final dataset |
| + line that will be printed to the output file. |
| + |
| + Args: |
| + input_filename: name and path of the input dataset. |
| + output_filename: name and path of the converted file, default is None. |
| + display_nothing: output display on the screen, default is True. |
| + display_input_lines: output display of the inpute file, default is False. |
| + display_converted_lines: output display of the converted file, |
| + default is False. |
| + """ |
| + self._fields = [ |
| + u'NAME_FIRST', |
| + u'NAME_MIDDLE', |
| + u'NAME_LAST', |
| + u'EMAIL_ADDRESS', |
| + u'COMPANY_NAME', |
| + u'ADDRESS_HOME_LINE1', |
| + u'ADDRESS_HOME_LINE2', |
| + u'ADDRESS_HOME_CITY', |
| + u'ADDRESS_HOME_STATE', |
| + u'ADDRESS_HOME_ZIP', |
| + u'ADDRESS_HOME_COUNTRY', |
| + u'PHONE_HOME_WHOLE_NUMBER', |
| + u'PHONE_FAX_WHOLE_NUMBER', |
| + ] |
|
dennisjeffrey
2011/02/11 00:53:17
Since _fields is just a constant array, would it b
dyu1
2011/02/16 03:17:31
Done.
|
| + self._output_pattern = u"{" |
|
Nirnimesh
2011/02/11 19:39:54
prefer single quote char '
dyu1
2011/02/16 03:17:31
Done.
|
| + for key in self._fields: |
| + self._output_pattern += u"u'%s': u'%s', " %(key, "%s") |
|
dennisjeffrey
2011/02/11 00:53:17
I think this could be re-written like this:
self.
dyu1
2011/02/16 03:17:31
Done.
|
| + self._output_pattern = self._output_pattern[:-1] + "},\n" |
| + |
| + self._input_filename = input_filename |
|
dennisjeffrey
2011/02/11 00:53:17
We should probably check to ensure that input_file
dyu1
2011/02/16 03:17:31
Done.
|
| + self._output_filename = output_filename |
| + self._display_nothing = display_nothing |
| + self._display_input_lines = display_input_lines |
| + self._display_converted_lines = display_converted_lines |
| + self._record_length = len(self._fields) |
|
dennisjeffrey
2011/02/11 00:53:17
Perhaps we could remove this variable and just rep
dyu1
2011/02/16 03:17:31
Done.
|
| + |
| + def CreateDictionaryFromRecord(self, line): |
|
dennisjeffrey
2011/02/11 00:53:17
If this function is only used by the _Convert() fu
dyu1
2011/02/16 03:17:31
Done.
|
| + """Constructs and returns a dictionary from a record in the dataset file. |
| + Escapes single quotation first and uses split('|') to separate values. |
|
dennisjeffrey
2011/02/11 00:53:17
This first line of the comment should be a 1-line
dyu1
2011/02/16 03:17:31
Done.
|
| + |
| + Example: |
| + Take an argument as a string u'John|Doe|Mountain View' |
| + and returns a dictionary |
| + { |
| + u'NAME_FIRST': u'John', |
| + u'NAME_LAST': u'Doe', |
| + u'ADDRESS_HOME_CITY': u'Mountain View', |
| + } |
| + |
| + Arg: |
|
dennisjeffrey
2011/02/11 00:53:17
"Arg" --> "Args"
(I think it should be "Args" eve
dyu1
2011/02/16 03:17:31
Done.
|
| + line: row of record from the dataset file. |
|
dennisjeffrey
2011/02/11 00:53:17
Since this method returns something, you should ha
dyu1
2011/02/16 03:17:31
Done.
|
| + """ |
| + # Ignore irrelevant record lines such as comment lines. |
|
dennisjeffrey
2011/02/11 00:53:17
Besides comment lines, what other lines are consid
dyu1
2011/02/16 03:17:31
Done.
|
| + if not '|' in line: |
|
dennisjeffrey
2011/02/11 00:53:17
What if a comment contains a "|" character? Then
dyu1
2011/02/16 03:17:31
No, I have a check in place (line 129) where it ch
dennis_jeffrey
2011/02/16 19:43:29
Oh, ok. I didn't realize that each line is expect
|
| + return |
|
dennisjeffrey
2011/02/11 00:53:17
Is it possible to have a valid line that does not
dyu1
2011/02/16 03:17:31
Well the dataset given to me is in the following f
dennis_jeffrey
2011/02/16 19:43:29
Ok, I see. I was thinking that in general, a reco
|
| + re_pattern = re.compile("'", re.UNICODE) |
| + line = re_pattern.sub(r"\'", line) |
|
dennisjeffrey
2011/02/11 00:53:17
You might want to add a comment to describe what y
dyu1
2011/02/16 03:17:31
Done.
dennis_jeffrey
2011/02/16 19:43:29
Oops, sorry - Now that I see your comment, I reali
|
| + |
| + line_list = line.split('|') |
| + if line_list: |
| + # Check for case when a line may have more or less fields than expected. |
| + if len(line_list) != self._record_length: |
| + print >> sys.stderr, "Error: a '|' seperated line has %d fields \ |
| + instead of %d" % (len(line_list), self._record_length) |
| + print >> sys.stderr, "\t%s" % line |
| + return |
|
dennisjeffrey
2011/02/11 00:53:17
How about raising an exception rather than just re
dyu1
2011/02/16 03:17:31
Done for logging.
If I raise an exception here th
dennis_jeffrey
2011/02/16 19:43:29
Ok, I think a logging.warning like what you do now
|
| + out_record = {} |
| + i = 0 |
| + for key in self._fields: |
| + out_record[key] = line_list[i] |
| + i += 1 |
|
dennisjeffrey
2011/02/11 00:53:17
It looks like here, you're assuming that the order
dyu1
2011/02/16 03:17:31
Yes, since the order of the keys from the order in
|
| + return out_record |
| + |
| + def _Convert(self, input_file, output_file): |
| + """The real conversion takes place here. |
|
dennisjeffrey
2011/02/11 00:53:17
I think it would be more useful to say what's bein
dyu1
2011/02/16 03:17:31
Done.
|
| + |
| + Args: |
| + input_file: dataset input file. |
| + output_file: the converted dictionary list output file. |
|
dennisjeffrey
2011/02/11 00:53:17
Since this function returns something, you need a
dyu1
2011/02/16 03:17:31
Done.
|
| + """ |
| + list_of_dict = [] |
| + i = 0 |
| + if output_file: |
| + output_file.write("[") |
| + output_file.write(os.linesep) |
| + for line in input_file.readlines(): |
| + line = line.strip() |
| + if not line: |
| + continue |
| + line = unicode(line, 'UTF-8') |
| + output_record = self.CreateDictionaryFromRecord(line) |
| + if output_record: |
| + i += 1 |
| + list_of_dict.append(output_record) |
| + output_line = self._output_pattern %tuple( |
|
dennisjeffrey
2011/02/11 00:53:17
Put a space after the "%".
dyu1
2011/02/16 03:17:31
Done.
|
| + [output_record[key] for key in self._fields]) |
| + if output_file: |
| + output_file.write(output_line) |
| + output_file.write(os.linesep) |
| + if not self._display_nothing: |
| + if self._display_input_lines: |
| + print "\n%d: %s" %(i, line.encode(sys.stdout.encoding, 'ignore')) |
|
dennisjeffrey
2011/02/11 00:53:17
Put a space after the "%".
dyu1
2011/02/16 03:17:31
Done.
|
| + if self._display_converted_lines: |
| + print "\tconverted to: %s" %output_line.encode( |
|
dennisjeffrey
2011/02/11 00:53:17
You may want to consider using the "logging" modul
dennisjeffrey
2011/02/11 00:53:17
Put a space after the "%".
dyu1
2011/02/16 03:17:31
Done.
|
| + sys.stdout.encoding, 'ignore') |
| + else: |
| + if not self._display_input_lines and not i % 10: |
| + print "\t%d lines converted so far!" %i |
|
dennisjeffrey
2011/02/11 00:53:17
Put a space after the "%".
dennisjeffrey
2011/02/11 00:53:17
I assume all lines should be converted nearly inst
|
| + if output_file: |
| + output_file.write("]") |
| + output_file.write(os.linesep) |
| + if not self._display_nothing: |
| + print "%d lines converted SUCCESSFULLY!" %i |
|
dennisjeffrey
2011/02/11 00:53:17
Put a space after the "%".
dyu1
2011/02/16 03:17:31
Done.
|
| + print "--- FINISHED ---" |
|
dennisjeffrey
2011/02/11 00:53:17
Again, consider using "logging" instead of "print"
dyu1
2011/02/16 03:17:31
Done.
|
| + return list_of_dict |
| + |
| + def Convert(self): |
| + """Takes arguments of two file names and creates two file objects, then |
|
dennisjeffrey
2011/02/11 00:53:17
This method actually doesn't take any parameter ar
dyu1
2011/02/16 03:17:31
Done.
|
| + calls _Convert() with these two file objects to do the real conversion.""" |
|
dennisjeffrey
2011/02/11 00:53:17
The first comment line should be a 1-line summary
dyu1
2011/02/16 03:17:31
Done.
|
| + with open(self._input_filename) as input_file: |
| + if self._output_filename: |
| + with codecs.open(self._output_filename, mode = 'wb', |
| + encoding = 'utf-8-sig') as output_file: |
|
dennisjeffrey
2011/02/11 00:53:17
Remove the spaces around the "=" when specifying t
dyu1
2011/02/16 03:17:31
Done.
|
| + return self._Convert(input_file, output_file) |
| + else: |
| + return self._Convert(input_file, None) |
| + |
|
dennisjeffrey
2011/02/11 00:53:17
Should have an extra blank line here: the style gu
dyu1
2011/02/16 03:17:31
Done.
|
| +def main(): |
| + c = DatasetConverter(r'../data/autofill/dataset.txt', |
|
dennisjeffrey
2011/02/11 00:53:17
Is it better to hard-code the input filename and o
dyu1
2011/02/16 03:17:31
Well command-line input would be find for the stan
dennis_jeffrey
2011/02/16 19:43:29
When this module is invoked via the PyAuto test, t
|
| + r'../data/autofill/dataset_duplicate-profiles.txt') |
|
dennisjeffrey
2011/02/11 00:53:17
The second argument should line up underneath the
dyu1
2011/02/16 03:17:31
Done.
|
| + c.Convert() |
| + |
| +if __name__ == '__main__': |
| + main() |