Licence CC BY-NC-ND, Thierry Parmentelat
This activity is about parsing text files, and building structures using builtin types.
read a file¶
works on:
listfiletuplethe input: file contains lines like
first_name last_name email phonefields are separated by any number (but at least one) of spaces/tabs, like e.g. in file
people-simple-03todo: write a function for parsing this format; it should return a list of 4-tuples
def parse(filename) -> list[tuple[str, str, str, str]]: ...
indexing¶
works on: hash-based types, comprehensions
what we need: a fast way to
check whether an email is in the file
quickly retrieve the details that go with a given email
question: what is the right data structure to implement that ?
todo:
write a function
def index(list_of_tuples):that builds and returns that data structure
write a function
def initial(list_of_tuples):that indexes the data on the initial of the first name (what changes do we need to do on the resulting data structure ?)
dataframe (optional)¶
works on: dataframes
todo: build a pandas dataframe to hold all the data
tip: see the documentation of
pd.DataFrame()and observe that there are multiple interfaces to build a dataframe
groups¶
works on:
seta more elaborate input
the file now contains optional fieldsfirst_name last_name email phone [group1 .. groupn]where the part between
[]is optional, i.e there can be 0 or more groupnames mentioned on each student line; like e.g. frompeople-groups-10:todo duplicate and tweak the
parsefunction, so as to writedef group_parse(filename):so it now returns a 2-tuple with
the list of tuples as before
a dictionary of sets
the keys here will be the group names,
and the corresponding value is a set of tuples corresponding to the students in that group
regexps (optional)¶
works on: regexps
what we need: be able to check the format for the input file:
first_name and last_name may contain letters and
-and_email may contain letters, numbers, dots (.), hyphens (-) and must contain exactly one
@phone numbers may contain 10 digits, or
+33followed by 9 digits
todo: write a function
def check_values(L: list[tuple]) -> None:that expects as an input the output of
parse, and that outlines ill-formed inputnote on ASCII vs Unicode input:
in a first approximation, use patterns like
a-zto check for letters;how does this behave with respect to names with accents and cedillas
then play with
\wto see if you can overcome this problem