Modernizing the Syntax of Regular Expressions

(1)

Modernizing the syntax of regular

expressions

Adam Andersson & Ludwig Hansson

08/06/2020

Faculty of Computing

Blekinge Institute of Technology SE-371 79 Karlskrona Sweden

(2)

This thesis is submitted to the Faculty of Computing at Blekinge Institute of Technology in partial fulﬁllment of the requirements for the bachelor’s degree in software engineering. The thesis is equivalent to 10 weeks of full-time studies.

Contact Information: Authors: Adam Andersson E-mail: adam.m.andersson@gmail.com Ludwig Hansson E-mail: ludwig.hansson1@gmail.com University advisor: Mikael Svahnberg

Department of Software Engineering

Faculty of Computing Internet : www.bth.se Blekinge Institute of Technology Phone : +46 455 38 50 00 SE-371 79 Karlskrona, Sweden Fax : +46 455 38 50 57

(3)

Abstract

Context. Writing and working with regular expressions could be a slow and tedious task, which is mainly because of its syntax, but also because there exist several different dialects which easily could cause confusion. Even though regular expression has been widely used for parsing and programming language design, they are now frequently used for input validation and seen in common applications such as text editors.

Objectives. The main objectives of our thesis are to determine whether or not a regular expression language that is more like the surrounding programming language would increase usability, readability, and maintainability. We will then investigate further into what kind of impact this would have regarding e.g, development speed, and what beneﬁts and liabilities a more modernized syntax could introduce.

Methods. Two different methods were used to answer our research questions, exploratory in-terviews, and experiments. The data from the experiments were collected by screen recording and a program in the environment we provided to the participants.

Results. By doing interviews with developers that use traditional regular expressions on a regular basis, their stories conﬁrm that its syntax is confusing even for developers with a lot of experience. Our results from the experiment indicate that a language more like the surrounding programming language increases both the overall ease of use and development speed.

Conclusions. From this research, we can conclude that a regular expression language that is more like the surrounding programming language does increase usability, readability, and maintainability. We could clearly see that it had a positive effect on the development speed as well.

(4)

Acknowledgments

We would like to thank our supervisor Mikael Svahnberg, Associate Professor/Docent at Blekinge Institute of Technology for all the support and feedback for this thesis.

Special thanks to all the individuals who participated in the experiment, and to the developers who took their time to attend the interviews.

(5)

Chapter 1 Introduction

1.1 Background

At ﬁrst sight for most people, regular expressions (regexes) looks intimidating. The main reason is probably because of its unreadability, and at ﬁrst sight one might also think that it is just a sequence of random characters if one do not know the meaning behind them [7]. Reading others and even reading your own regexes can be a real challenge, especially if you go back and try to read an old regex you wrote a while ago.

Regexes are simply built upon a sequence of characters, and those characters together are defining a search pattern. Because of this, most people can find the syntax extremely unintu-itive because the language does not resemble any other language. Another way to think about a regex is that they can be thought of as a special text string that represents a specific search pattern that describes what one would like to match, which is why it is considered to be such a powerful string processing tool [16]. For instance, imagine that you have a list of email addresses, and you want to find all addresses that end with "@hotmail.com", this could be achieved with the following regex:

[a-zA-Z0-9_\.]{1,64}@hotmail\.com

This regex would match "@hotmail.com"-addresses that begin with both upper and lowercase alphabetic characters between A-Z, digits between 0-9, underscore and dot, with at least one character and a maximum of 64 characters in one sequence. Although, this is a trivial example and would not match all email addresses, it will give you a hint on what the traditional regex syntax looks like.

Regexes can be seen in multiple applications, one common application is a text editor which has support for the text replacement/search tools [12]. However, this is mostly used to replace for instance all occurrences of a specific word. Worth noting is that some text editors do also support pattern matching, which makes it possible to write a regex inside the text editor as well. Another example could be to extract all phone numbers from a log file. In this case, one cannot simply input a specific string containing the phone number because then it would just match that specific one. Therefore, one would have to provide some sort of pattern of how a phone number is constructed.

E.g, to extract all lines containing a Swedish phone number, one would search for phone numbers that begin with the area code +46 followed by nine random digits.

Front-end developers are quite often also introduced to regexes when they need to validate input, e.g, it could be used to validate the input of an email address ﬁeld to make sure that it contains the correct format [14].

It is currently not possible to debug the regexes you write, without the use of an external tool such as [3], nor is there any easy way of writing them because of its unintuitive syntax, and

(8)

there is little to none error handling involved [9]. To overcome these issues, we present Regify.

Regify is a language developed which can be used as an alternative way of writing regexes.

Writing your regexes in Regify makes the regexes much easier to read, write and debug due to its more intuitive syntax and inbuilt error handling system.

1.2 Related work

Previous work has been done within the area of trying to increase the readability of regexes and a few have tried to create alternative ways of writing them as well.

Chang and Manning [4] built TOKENSREGEX which is a framework for building cascaded regexes over token sequences, so instead of building the pattern with a sequence of characters one can now build it as a sequence of tokens instead. However, Chang and Manning [4] says that the syntax they have deﬁned for the framework is still very similar to the traditional syntax of a regex. In [4] a few examples are shown using the syntax of TOKENREGEX to write regexes, two of the examples are shown below in order to demonstrate its syntax. Match: Picasso is an artist

[ner:PERSON]+ [pos:VBZ] /an?/ /artist|painter/ Match: 50 kilometers

(?:quant [ner:NUMBER]+ ) /km|kilometers?/

Even though there are a few differences in the syntax shown, it is indeed still similar to tradi-tional regexes, and in our opinion the syntax still remains unintuitive and hard to work with. Furthermore, an attempt to increase the readability is taken, which is to support deﬁnitions of macros that can be used in later regexes, for example:

$UNIT = /km/kilometers?/ [ner:NUMBER]+) $UNIT

A similar approach to this has been considered by us as well, but this is something we have left out for the future, hence it is out of scope.

Even though this could increase the ease of which you can modify existing regexes, the chal-lenge of increasing usability and readability remains unsolved.

Another approach has been taken by Beck et al. [3] where they present RegViz, which is an approach to visually augment regexes without changing their original textual notation. RegViz is a Web-Based tool that can be found at http://regviz.org/, which at the moment has support for Javascript regexes. By feedback collected by experts they could conclude that by using this tool, it increased the understanding of regexes. However, even though it is a great tool that could be used to visually debug traditional regexes and to make it easier to understand them, it is still a challenge to write and use the regexes because the syntax remains unchanged. Hollmann and Hanenberg [7] studied the readability of regexes in textual representation versus graphical, their experiment showed that the time it took for the participant to answer a question about the shortest word in textual representation was on average three times higher than the time it took to answer the question using a graphical representation instead. They could also conclude from one of their experiments that the length of a regex is a strong indicator of how readable a regex is.

Erwig and Gopinath [5] presents a new representation of regexes which can be used as an alternative way of explaining them. Not only does this help with the understanding of a

(9)

regex, but it could potentially help identify faults in them as well. Furthermore, an example of the initial regex is shown, followed by the representation it has after they have run their decompose method on it. The initial regex:

<\s*[aA]\s+[hH][rR][eE][fF]=f\s*>\s* <\s*[iI][mM][gG]\s+[sS][rR][cC] =f\s*>[^<>]*<\s*/[iI][mM][gG]\s*>\s*<\s*/[aA]\s*>

Decomposed regex:

Not only does this representation make the regex easier to read, but it also helps with the understanding of what it does. Therefore, it is with no doubt one might say that their method increases the understanding of a regex. Even though they present a great way of breaking down a regex into a much simpler and understandable form, they have yet not solved the challenge of reading and using existing ones in practice, nor creating an alternative way of developing them.

Michael et al. [12] studied the difﬁculties a programmer face when writing a regex, what decision that has to be made and the development cycle. The study shows that the participants do not only think that a regex is hard to read, they also think they are hard to search for, validate and to document.

According to their study, one common method that was used to improve comprehension for other developers, was to break the regex into several parts, on multiple lines, where each of the individual part contained a comment describing its functionality. However, this is something that is, to the best of our knowledge, rarely supported by regex implementations and one of the reasons why we will support it in our language.

1.3 Purpose

Regexes are used in many different aspects of software engineering because it is such a pow-erful string processing tool. However, they are error-prone and provide little to none feedback when errors occur since they are often syntactically correct [9]. Because most of the regexes are syntactically correct, it is not possible to have syntax highlighting which is considered important to increase code readability [17]. With this said, it is obvious that this is something we want to support in our language. VSCode and Atom are currently the two text editors we support syntax highlighting in.

The idea of creating a new and convenient language came from earlier courses we have had where regexes were involved. During these courses we observed that a lot of students strug-gled with getting their regexes to work, mainly due to its lack of readability. However, it is not only students that struggle with regexes, they are are quite commonly used for web devel-opment during testing and validating that the user input is correct, e.g for phone numbers or email addresses. If we can provide an easier way that is less error-prone, maybe the develop-ers make fewer mistakes during validation and testing which may reduce the number of bugs in the software later on, e.g in production which could potentially save the companies a lot of money.

1.4 Scope

In Regify the focus will be on how a regex syntax more similar to the surrounding program-ming language syntax can increase the readability, maintainability, and usability when writing

(10)

regexes.

The following three attributes are deﬁned differently by authors, and could also vary depend-ing on the context. We will give a brief explanation of the meandepend-ing behind them in our work.

Readability refers to the time spent by the participant to complete a task, and the number of

attempts that have been made before a pattern has been fully matched. Maintainability refers to at what ease the developer can understand a snippet of code after a while of not looking at it, and the easiness or hardness of the required efforts to perform changes to it. I.e, the required efforts to perform changes such as modifying existing code, add new functionalities or ﬁx bugs that might have been introduced should be low [10].

Usability refers to what extent the language provides information when mistakes are made,

and by what ease one can learn and use the language.

In the development of Regify we had to focus on a subset of the available functionality of regexes since implementing the functionality for everything was not needed to conduct our experiments and could, therefore, be something for future work. Examples of functionalities that were excluded in Regify:

• API call for named groups.

• Keywords for special sequences such as \d and \w.

• Keyword to match zero to one repetitions of the preceding regex, unless it is a set of characters.

1.5 Brief overview

To get a better understanding of how regexes are used in the industry we conducted ex-ploratory interviews with three professional developers who had at least ﬁve years and up to ten years of experience writing regexes. The purpose of these interviews are to get a bet-ter understanding of common problems they have encounbet-tered over the years while writing regexes, in what areas they have been used, and also what their thoughts are about regexes and the idea of introducing a new syntax for them.

For the purpose of evaluating an improved way of working with regexes, a syntax is designed that improves the readability of regexes, which in a way can be used directly together with Pythons re-module. This new syntax is then evaluated through an experiment where the sub-jects are given a number of tasks to solve with or without the new syntax.

To put Regify to the test, we have constructed an experiment where participants are instructed to solve several tasks with both Regify as a language and traditional regex. In each task, the participants are given a dataset where they need to come up with a pattern that successfully matches the content of another ﬁle. The participants are students who have been programming for at least three years and have little to none experience with regex.

We decided to target the syntax of regexes in Python since it has a few very powerful modules for writing regexes. However, our goal is not to replace entire existing re-module, we just want to replace the syntax, which means our language will be used as a plug-in to these existing modules.

(11)

analyzer (lexer) will be built, which is used to analyze the program text. The program text is then broken down into tokens which correspond to symbols in Regify. The next step is to develop a parser which does the syntactic analysis of the tokens provided by the lexer. The parser in our case will also create a Abstract Syntax Tree (AST) which corresponds to the structure of the program and its internal types that have been created with the tokens. As soon as the AST has been constructed we decided to do semantic analysis which mainly in our case includes type checking. Finally, when all the previous phases have passed, we translate the AST to the targeted language, in this case the Python re-module. For a more detailed description about compiler/programming language construction, we refer the reader to [13].

(12)

Chapter 2 Research Questions

Regexes have a steep learning curve due to its unintuitive syntax, and they could therefore quite easily become hard to understand, develop, and maintain. Because of this, we wanted to introduce and evaluate an alternative way of writing them, which is a regex language more similar to the surrounding programming language. By doing so, we wanted to ﬁnd out what kind of effect this could have on usability, and what impact it would have regarding the devel-opment speed since this e.g, would require you to write more characters. Of course, it would then be interesting to investigate further into what other kinds of beneﬁts and liabilities this language could possibly introduce. Therefore, the following three questions will be answered: RQ1: What is the effect on usability when using a regular expression language more like the

surrounding programming language?

RQ2: What is the impact on development speed of regular expressions when writing in a

regular expression language more like the surrounding programming language?

RQ3: What are the experienced beneﬁts and liabilities of choosing to write regular

expres-sions in a more modernized syntax that is more like the surrounding programming language, rather than the traditional way?

(13)

Chapter 3 Regify the Language

3.1 The language

To answer our research questions we needed some tools that were not available, one of which was the actual language to compare the traditional regex syntax against, which became Regify. The goal with Regify was to have a regex language that looked more familiar to a high-level syntax (e.g Python/Java/C++) since these languages are far more readable than the traditional regex syntax. This is because they have visual features for the source code such as syntax highlighting and formatting/indentation, which the traditional regex syntax does not have. With Regify we created syntax highlighting for two common editors, Atom and VSCode. We decided to go with them because most of the participants in our experiment use them for writing code.

3.1.1 API calls and Keywords

In traditional regex, one may use a lot of small components to perform a larger task, which sounds great but these components are usually quite small and unreadable for the average user, e.g. if the pattern contains a static part "/bin/env/" followed by a set of both upper and lowercase alpha-characters that ranges from 4 to 32, one would have to write something like this

\/bin\/env\/[a-zA-Z]{4,32}

while in Regify one would write

@"/bin/env/" VARCHAR("a-zA-Z", 4, 32)

By doing this we can combine components that are commonly used together in order to bring down the complexity of the overall language. As one can observe in the example above, the API call "VARCHAR" is a combination of the classic character group with the length built-in as parameters. Although the combination of components helps to bring down the complexity, the identiﬁer for the function is just as important to increase the code readability as Holzmann [8] states, "Naming is important because it affects the readability of your code and the ease

with which you can ﬁnd your way around when reviewing code".

3.1.2 All caps vs. Mixed

While designing the syntax of Regify we had several things in mind to make it as easy as possible to use. At ﬁrst, it was possible to mix lowercase and uppercase in our keywords, similarly like in SQL. However, this was rather quickly changed to only supporting all caps keywords which according to us seemed to have a huge positive impact regarding readability in the language. Akour and Falah [1] says that "Although SQL is not a case sensitive, some

(14)

developers quickly adopted the convention of writing queries in capital case letters to distin-guish between keywords and the database objects.". However, we could not ﬁnd any research

supporting whether or not this does increase the readability or not.

3.1.3 Escaping special characters

A character with a special meaning is deﬁned as a special character (also known as a metachar-acter) in a regex. For example, the special character dot means match every character except a newline [16]. Therefore, to match the literal character dot, it has to be escaped with a pre-ceding backslash. I.e, to match the literal character dot, one would have to write ’\.’. A full list of the special characters available in a regex can be found in [16].

In Regify the special characters that you normally need to escape in regex is already escaped by us, which makes it easier for the user to focus on matching the desired pattern instead of worrying about what to escape and when, which seemed to be a real struggle, and one of many pitfalls one might fall into while writing regexes. Let us say that, you for some reason, would like to match the following pattern in a traditional regex

{(}[/<-\\.//->\])

one might write something like the following where all of the special characters has been escaped.

{$}\[\/<-\\\\.\/\/->\\\]$

However, since escaping special characters are something Regify takes care of, the same pat-tern could be matched by simply writing

@"{(}[/<-\\.//->\])"

which is the exact same pattern wrapped inside @"", which is the keyword used for matching literal text in Regify.

In some cases when writing traditional regexes, one may encounter the combination ".*" which means "match anything until the last occurrence". These combinations are usually not particularly self-explanatory and may confuse the users reading/writing the expression. The way Regify deals with this is to use keywords that provide more readability and ease of use.

3.1.4 Comments and Formatting

Previous research has shown that commenting your code increases its readability, and if we can successfully increase the readability by supporting this, we will also increase the maintain-ability. Scalabrino et al. [17] says that "if code is readable, it is pretty easy to start changing

it; instead, modifying unreadable code is like assembling a piece of furniture with instructions written in a foreign language the one does not speak: the task is not impossible, but difﬁcult, and a few screws still may remain unused". Therefore, being able to comment your regex in Regify would make it much easier to understand after a while of not looking at it, and also

when you need to modify an existing regex. Sedano [18] also states "One of the factors that

leads to improved code maintainability is its readability".

Being able to have a consistent indentation in your code also has a huge impact on how easy it is to read, instead of just writing everything on one line like in traditional regexes [17].

(15)

Figure 3.1: Example of Regify code versus traditional regex when matching simple cases of IPv6 addresses.

As we can observe in ﬁgure 3.1, Regify provides a higher level syntax compared to traditional regexes. Since Regify has built-in features for eliminating code repetitions, such as the API call "REPEAT" which repeats its content as many times as the user wants. In this example we can see how easy it would be to change the length from "0-4" to "1-4", which only requires two changes, compared to eight in the traditional regex.

3.1.5 Error handling

As stated earlier, regexes have little to none error handling since the regexes you write often are syntactically correct. Therefore, we wanted to provide a lot of error handling that makes it extremely easy to see whenever there has been some sort of error in Regify.

Each error will give you an exact location in the code, in forms of highlighting the row that contains the error, it will also give a detailed description of what type of error it is and to some extent show what has to be done to ﬁx it. However, all of the features for this have not yet been implemented due to lack of time, but might be work for the future. At the moment, there are three different types of errors supported in Regify which will be discussed brieﬂy, and how or when they are raised.

(16)

3.1.5.1 ValueError

Figure 3.2: ValueError raised due to incorrect number of arguments.

An example is shown in ﬁgure 3.2 on how this error might look like. At the bottom of the ﬁgure, an example text is also displayed to indicate to the user what the correct usage is of the API call that caused the error.

In Regify a ValueError is mainly used to make sure the API calls have the correct number of arguments.

3.1.5.2 TypeError

Figure 3.3: TypeError raised due to an unexpected argument.

"VARCHAR" expects either a digit or the keyword "MORE" in this case, but in the example shown in 3.3 the Kleene star was found, which is neither one of the expected types and there-fore the TypeError is raised. In general, if an API call is used incorrectly with the wrong type of argument, this error is raised.

(17)

3.1.5.3 SyntaxError

Figure 3.4: Example of the SyntaxError message.

Error caused by not following the proper syntax of a language is defined as a SyntaxError. In this example, we can see by looking at the example usage that the API call "VARCHAR" expects at least two arguments. However, in the code there is only one given and therefore the SyntaxError is raised. In Regify this error is also raised when an unexpected identifier has been found. An "unexpected identifier" is a keyword or character that does not exist in the language, i.e. if "RANGE" was used instead of "VARCHAR", this error would be raised as well.

(18)

Chapter 4 Research Method

4.1 Interviews

We decided to do interviews with developers that work with regexes regularly in their work-flow. What we expect to achieve with these interviews is to get a better understanding of how regexes are used in practical use. The developers we used for these interviews had at least five years of experience with regexes and uses them on a regular basis in their workflow to solve certain tasks. The questions we asked were meant to focus on readability, maintainability,

and usability, and also what common problems they encounter while using regexes.

These are some key points we wanted to have answered during the interviews: 1. Experience

How much experience in regex do they have and what do they use it for on a regular basis? This would give us clarity of a few applications regexes have in real life.

2. Environment

What environment are they using for writing these regexes, i.e. in PHP, Python or Bash. 3. Typical use cases

With this question we want to ﬁnd out what typical use cases regexes can have and what impact it has on the application.

4. Readability

When are regexes considered to be too extensive, regarding the number of characters? 5. Maintainability

How often does a regex have to be changed? If they need to be modiﬁed, is it better to rewrite them or is it easy to add more features?

6. Common problems

What are some common problems they encounter while using regexes? When there is a syntax error, is there any valuable information missing?

(19)

4.2 Experiment procedure

Our focus in the experiment was to ﬁnd out how well Regify performs to the traditional regexes in three categories: readability, maintainability, and usability. The experiment was designed to target people with at least three years of experience in high-level programming languages, and little to none experience in regex before, since it would not only give us a fair comparison because the knowledge level is about the same for both languages, but they were also the target group we imagined would use Regify. During the experiment, the participants always started by solving the sub-task in Regify.

4.2.1 Subjects

The participants we used for this experiment had three years of experience in high-level pro-gramming languages like C++/Java/Python and little to none prior experience in regexes. The participants were students with educational programs in Software Engineering or Computer Science, all doing their third year. Before the experiment began, we allowed the students to read the API documentation we provided, but not practice writing Regify. A student who had participated in the experiment was not allowed to share any of their solutions to other students until the entire experiment was done.

4.2.2 Environment

In order to answer our research questions, we needed an environment in which the participants could download and run on their own machines in order to set up all dependencies. This would also give them the same starting point, tools and syntax highlighting, which is important for the experiment to be valid. The environment contains a directory with an "experiment manager"-script1 which the participants used to simplify the process of running the tasks. Instead of having to change directory back and forth for each task, this script was executed from the top directory and by providing arguments to specify the language and task they wanted to execute. This also gave us the possibility to log and save all necessary data that were generated from the participants. This directory also contained all the "task"-directories, which contained all the ﬁles where the participants should write the code into, and followed a simple structure for ease of use.

4.2.3 Instrumentation

During the experiment, the participants were allowed to search on the web for information that could help them forward since this would simulate a real-life use case the most. Because there was no information about Regify available we wrote a simple API documentation2where the participants could ﬁnd the correct syntax and parameters for all the supported API calls and keywords Regify supports. During the experiment with a participant, we were supervising them through a video call all the time, which allowed us to give them immediate feedback if something was unclear or if there were any errors in the environment while running the different tasks.

1_{The source code and experiment manager-script are both available at https://github.com/Aadamandersson/regify} 2_{The API-documentation is available at https://regify.github.io/}

(20)

4.2.4 Data collection and metrics

For the experiment we decided to measure:

1. How many attempts it took for the participant to complete a given task in both of the languages.

• Counting the attempts gives us good statistical data which describes how many runs are required before having a working regex, which we can compare against both languages. In our case we ﬁnd this extremely interesting since the participants were solving the tasks in both languages but started out with Regify which could give them an advantage when changing over to traditional regex.

2. The number of characters modiﬁed on each attempt.

• By measuring the number of characters that are being changed in a pattern we will get a better understanding of how readable and understandable the languages are. If the number of modiﬁed characters is high, this could indicate poor readability of the language where the user modiﬁes different things just to see what happens. 3. Average pattern length.

• Even though we know that Regify in general requires more characters in total to create the desired pattern, we wanted to see how big the difference really was and if there is any breaking point when Regify could actually require fewer characters compared to the same pattern in a traditional regex.

4. The total time spent on each task.

• With Regify we know that you will in most cases write more characters than you would with the traditional way of writing regexes, so by measuring time, we can determine if the amount of readability and usability keeps the development speed the same, or makes it even faster. This is also very interesting to measure when it comes to maintainability, e.g. how much time is needed for the user to understand what the regex means and which parts will have to be modiﬁed.

4.2.5 Tasks

The experiment was built upon nine different tasks to provide measurement data about us-ability and maintainus-ability between our syntax for regexes compared to the traditional regex syntax. Each task was required to be written in both Regify and traditional regex. In each task, we provided a description which contained examples of input and output data of a successfully matched pattern, and information about what to search for if they needed help to complete the task.

The tasks were designed to take about 10 minutes to complete, and because each task had to be written in both Regify and regex they had approximately ﬁve minutes on each language per task. This would give us an estimation that the experiment would take roughly 90 minutes to complete, which we thought were sufﬁcient due to the complexity of the tasks.

The first six tasks were focused on usability, meaning there was no code provided for the task. In these tasks the participants had to look up and find out what pieces were needed to complete the task. The three remaining tasks were partially completed in both Regify and traditional regex, but they had to be modified in order to successfully complete the tasks.

(21)

4.2.5.1 Example tasks

A full list of the tasks is available in Appendix A, this is where the starting point code for task six to nine can be found as well.

Example of maintainability tasks include modifying a regex that only matches email addresses using the "com" and "net" root domains, but to successfully complete this task the participant had to extend the regex to include the "se" root domain as well.

Example of usability tasks include extracting all the lines containing an error from a given input ﬁle. The following is an example of provided data to the participant:

/dev/rdisk1s2: fsck_apfs started at Sun Jan 19 19:42:38 2020 /dev/rdisk1s2: ** QUICKCHECK ONLY; FILESYSTEM CLEAN

/dev/rdisk1s2: fsck_apfs completed at Sun Jan 19 19:42:38 2020

/dev/rdisk1s4: error: container /dev/rdisk1 is mounted with write access ;please re-run with -l.

The following is an example of the expected output with the previously shown example data as input:

error: container /dev/rdisk2 is mounted with write access; please re-run with -l.

(22)

Chapter 5 Results and Analysis

5.1 Results

5.1.1 Interviews

Our interviews were focused on getting a better understanding of how regexes are used in real-life scenarios, ranging from ﬁle searching to user input validation. We had three developers in total with at least ﬁve years and up to ten years of experience with writing and maintaining regexes that are used in software available for commercial use. The questions asked were focused around our three main categories: readability, maintainability, and usability.

From these interviews we noticed right away that the developers shared the same opinion that the main problem with traditional regexes is the steep learning curve due to the syntax. Be-cause of this combined with lack of motivation, a lot of developers does not put their time into learning regexes. The developers we interviewed agreed that regexes are an extremely pow-erful text procession tool. In their experience, almost all web developers encounter regexes in one way or another, since it is commonly used for validating user input in text forms and when rewriting code, e.g. to replace static IP addresses or paths when switching environments. As one of the developers said, "if regular expressions were not as powerful as they are, then they

would not have existed, or at least in a different form", which the others also agreed with.

What the developers seemed to like with the traditional way of writing regexes is that they tend to be slim, although when they are longer than 25-30 characters they seem to be leaning more towards the unreadable stage. Even though some of these developers write regexes on a regular basis, they pointed out that it does not take a lot of time before you start forgetting the syntax, even for experts. Although the developers pointed out that one of the most chal-lenging parts when working with, and learning regexes are knowing what characters have to be escaped. Even for experts it sometimes gets hard to understand what characters to escape that needs to be escaped, especially if there are a lot of them in a row. Sometimes the different regex libraries have some dialectal differences that can cause confusion.

Usually when these developers use already existing regexes, e.g. to match email addresses, it tends to be quite difﬁcult to modify the regex, especially if they did not write it themselves. Instead of changing the existing regex, they usually try to leave as much of the existing regex unchanged and instead complement it with their own part in order to match the speciﬁc pat-tern. Changing a regex is often a quite exhausting task that can easily introduce errors and therefore, the developers tend to leave it as is, assuming it works as intended.

According to these developers, writing wrappers for programming languages is a quite popular approach to increase readability and usability for rough languages, such as the work we have done with the syntax for regexes. Another example of this could be TypeScript, which is a

(23)

wrapper for JavaScript with the purpose to increase readability and maintainability. Two of the developers pointed out that sometimes there are some dialectal/minor differences between the regexes when switching from one environment to another, e.g. from using regexes in PHP and to use it in Perl.

During the interviews, the developers concluded that these attributes would be preferred in a regex language such as Regify

1. Auto-escaping characters

Since escaping characters is such a difﬁcult task, we think it is important to handle this automatically in the language. This will probably also help reducing errors in the code. 2. Syntax highlighting

This is an easy way to help improving the readability for the language, and we think it is almost a "must-have" feature for a modern language

3. Comments

Commenting your code can easily help you understand what a certain piece of code does, none the less for the traditional regex syntax

4. Error handling

As stated earlier, the traditional regex syntax has little to none error handling when a syntactical mistake is made, which was something we deﬁnitely wanted to support in our language.

5.1.2 Experiment

The data collected during our experiment was generated by a total of ten students with three years of background in software development using high-level languages. Each student had a maximum of two hours to complete the experiment, which appeared to be sufﬁcient enough in order to complete all the nine tasks we provided since the average time it took for the participants to complete the experiment was about one and a half-hour.

To measure the average time for each task we used screen-recording as a time marker when they began writing each test and ended when they had achieved a 100% match. Our initial thought was to use the time we saved for each run to determine the total time spent on the tasks, but this turned out to be very inaccurate because the time between each task when reading the descriptions would have been included, and if they went to another task and came back later the time would not be accurate at all.

During the experiment when the participants made an error in the code, we observed the actions and steps taken in order to solve the problem. Since Regify provides an error message containing information about what is wrong, shows what line the error occurs at, and gives example usage of the API call (see ﬁgure 3.4), the participants directly returned to the code and resolved the error. Compared to when there was an error in the traditional regex code, the participants had a hard time understanding what should be changed. This resulted in participants trying to change characters without really knowing the effect of it and struggled with understanding what should be modiﬁed or not. This behavior was especially clear when special characters were involved, which has to be escaped in order to match them.

(24)

5.1.3 Experimental Results

In the following ﬁgures we present the results with the data collected from the experiment. We would once again like to point out that, in all of the tasks the participant started out in

Regify which gave them the advantage of solving the task in this language before moving over

to traditional regex. All of the results shown are based on the participant solving the entire task, before moving on to the next one.

Figure 5.1: Average number of attempts until a 100% match.

Average attempts were calculated on the total runs for each task over the entire experiment. All attempts before and including the 100% match were counted, additional runs were not.

(25)

Figure 5.2: Average pattern length for each task.

The pattern length was calculated by subtracting whitespaces, newlines and tabs, since some editors used soft indentation (i.e. four whitespaces).

Figure 5.3: Average number of characters modiﬁed until a 100% match.

Average characters modified per pattern were calculated on each attempt that failed to match the desired pattern. This did not include the participants first attempt since it is not considered as "modification".

(26)

Figure 5.4: Average time spent per task.

The total time per task was timed from when the participant began writing after reading the instructions of the task and continued until a 100% match was achieved.

5.2 Analysis

As seen in figure 5.1 we can observe that the participants tend to have a lot fewer attempts until getting a perfect match when using Regify compared to traditional regexes. Since the participants have never seen Regify and at most just touched regexes before, this could indicate that the learning curve is significantly lower when writing in a regex language that is more similar to the surrounding programming language. The average time spent on each task (see figure 5.4) tend to be lower in Regify as well but there are a few tasks where traditional regex had less time spent or about the same. We can see that there are two tasks (seven and eight) where the average time was less or close to equal in traditional regex and those two tasks were in the maintainability category where the participant had to update the existing code in order to match the expected output. We think that the reason these two tasks were faster or had about the same time in traditional regex is because of the fact that they started out in Regify, and the objective for the task was to update existing code to match the desired pattern. Because of the structure of these two tasks, we think that because they solved it in Regify first, it gave them a clear picture of what they had to update in the traditional regex too, in order to match the pattern.

Results shown in ﬁgure 5.1 and 5.4 were both used to measure the amount of readability for

Regify. High readability, in this case, would be to have as few attempts as possible combined

with as little time spent as possible on each task. By looking at tasks one to three, we can see that Regify has both fewer attempts and less time per task compared to regex. These three tasks require the users to escape several characters, which is considered to be challenging when writing traditional regexes due to it being hard to understand and not particularly readable.

(27)

After analyzing the screen recordings we observed that some of the participants looked back at the code they wrote in Regify in order to get a better understanding of how the pattern works, and translate that to traditional regex code. The reason we think they did this was because of the readability Regify provides, which they could use to break down the pattern to make the translation easier. However, the number of attempts was still higher because it was harder to fully understand where to place the characters and what characters to escape since this is internally done by Regify.

One of the reasons we think the attempts went up even though they had a reference for what to write was because of the dialectal differences between the different regex libraries. When the participants googled their problems, they could ﬁnd what they needed but the regex would not always work because it sometimes was written for another regex library than Pythons. These regex libraries share some similarities but have also a lot of differences, for instance the special characters could differ depending on what regex library you use, or they might have different built-in features/shortcuts for matching words or digits.

Regify is a more verbose approach to express text patterns, resulting in more characters to

perform the same functionality as a traditional regex. Therefore, we expect that the number of modified characters will also be higher for Regify. The result shows that, while Regify do indeed require more character changes, it does not require dramatically more changes. On average, about 15% more characters are required in Regify compared to traditional regexes. Different dialects, and the need of escaping special characters were the main reasons it resulted in an increase of modified characters for traditional regexes (see figure 5.3).

In ﬁgure 5.2, we can see what the average pattern length was once the participant got a 100% match. Even though we knew that Regify in most cases probably would require a longer pattern due to its syntax we still thought it would be interesting to see whether or not this could change depending on the complexity of the regex.

If we look at the average pattern length of the three last tasks we can see that the difference between them is not especially high, it is just about a few characters.

By looking at ﬁgure 5.2 and 5.4 we can see that once the traditional regex becomes longer, it tend to increase the average time spent on the speciﬁc task compared to Regify, where it seems like the length does not have an effect on the overall time spent on each task.

(28)

Chapter 6 Validity Threats

In the experiment we conducted, we had a total of ten students participating. It would have been desirable with more participants, but to compensate for this we spent more time on the experimental environment to collect as much and accurate data as possible. Due to the number of participants that conducted the experiment, it is hard to tell how accurate the results are; hence there is a possible evident risk that may affect the validity of the results.

The fact that we are using students as subjects may have yielded different results compared to if professionals were used as a subject instead, which is something we are aware of and have been taking into consideration when designing the tasks for the experiment. There is always a risk of using students as subjects, however, this threat was reduced by using third year students which are in the end of their education [21].

During the experiment we noted that the participants in some tasks did not read the instructions properly and therefore made careless mistakes which affected the number of attempts it took for a given task to be solved. However, this mainly affected the number of attempts it took the solve the tasks in Regify since the participants started in this language.

(29)

Chapter 7 Conclusion

In this research, we have conducted exploratory interviews to get a better understanding of how regexes are used by developers in their workﬂow, and what problems they frequently en-counter. This gave us a better picture of what areas could be improved for increasing readabil-ity, maintainabilreadabil-ity, and usabilreadabil-ity, such as syntax highlighting, error handling, and removing the need for escaping special characters.

To compare and test the new language an experiment was constructed as well where we put up an environment that could collect important measuring data such as time spent per task, average attempts, etc. The experiment focused on evaluating readability, maintainability, and usability compared to the traditional regex syntax. We had a total of nine tasks where the ﬁrst six was designed to test the usability and the last three tasks tested maintainability. In our experiment we could see that the users felt more comfortable writing their regex code in

Regify due to the syntax being familiar to what they are used to.

RQ1: What is the effect on usability when using a regular expression language more like the

surrounding programming language?

Our results indicate that a regex language more similar to the surrounding programming lan-guage have a positive effect regarding usability. During the experiments the participants could by ease solve any errors that occurred, and the learning curve seems to be signiﬁcantly lower as well.

RQ2: What is the impact on development speed of regular expressions when writing in a

regular expression language more like the surrounding programming language?

Even though you need to write more characters in Regify, which could have had an effect on the development speed, our results indicate that the development speed in Regify is on average 80% faster compared to when writing a regex the traditional way.

RQ3: What are the experienced beneﬁts and liabilities of choosing to write regular

expres-sions in a more modernized syntax that is more like the surrounding programming language, rather than the traditional way?

After analyzing our results, we can conclude that using a more modern syntax for regexes reduces errors and mistakes by the developer and provides a lot more readability compared to traditional regex syntax. We have also experienced that the learning curve is much lower, and it has a positive impact on development speed. However, if you would write your regex with

Regify directly in Python, you would lose the syntax highlighting which could be a liability in

terms of how readable it would be.

Our ﬁnal conclusion after executing the experiment is that by introducing a regex language that is more like the surrounding programming language increases the usability,

(30)

maintain-ability, and readability for regexes, which was what we expected. Even though one might argue it is unnecessary to write in a language like this for simple expressions, we think that the fact that you can write inline regex in Regify still would be a better choice because of the possibility to comment and divide your regex into multiple lines.

(31)

Chapter 8 Future Work

Many features have been left out in Regify due to lack of time because the experiments con-ducted required a lot more time than anticipated but also because they were out of scope for this research. Even though this thesis was mainly focused on whether a new syntax for regexes could increase the readability, usability, and maintainability, it has not covered all the functionalities for building e.g, more complex expressions. We also decided not to implement any new features in Regify that a traditional regex would not support, because of the structure of our experiment. With that in mind, we have a few interesting ideas that might be worth looking further into:

1. Variable support.

• It would be interesting to see if introducing variables in Regify would increase the usability even more. By doing so, it would be possible to divide your expressions more efﬁciently and later on use the ones that you have already deﬁned. This could also increase the readability a lot if we assume the developer has good variable naming conventions. For instance, let us assume a valid email address contains alphanumeric characters, followed by ’@’, followed by alphanumeric characters, and lastly followed by ’.com’. Let us also assume that the length of the alphanu-meric characters has to range from 1 to 64. The following is a simple example of how it might look like:

alphaNum = VARCHAR("a-zA-Z0-9_", 1, 64) domainName = @".com"

emailAddress = alphaNum + @"@" + alphaNum + domainName Even though this is a trivial example, the idea of for example updating your pattern to also match ’.net’ domains would be extremely easy, and if you want to support more characters than just alphanumeric ones this could easily be changed too. 2. Translate a traditional regex into Regify.

• Something that could be interesting to see by enabling this translation between the languages is if this possibility would make it easier for the developer to debug the regex since it would be in a more readable representation. By doing so, Regify might also ﬁnd bugs and errors in the regex which potentially could save a lot of time.

In Regify we currently only support writing in capital letters (see section 3.1.2). However, it would be interesting to see whether or not this has an effect on readability and ease of use, which is something that has been left out for the future.

(32)

References

[1] M. Akour and B. Falah. “Application domain and programming language readability yardsticks”. In: 2016 7th International Conference on Computer Science and

Informa-tion Technology (CSIT). 2016, pp. 1–6.

[2] Pablo Barceló, Leonid Libkin, and Juan L. Reutter. “Parameterized Regular Expres-sions and their Languages”. In: CoRR abs/1107.0577 (2011). arXiv: 1107.0577.URL: http://arxiv.org/abs/1107.0577.

[3] Fabian Beck, Stefan Gulan, Benjamin Biegel, Sebastian Baltes, and Daniel Weiskopf. “RegViz: Visual Debugging of Regular Expressions”. In: Companion Proceedings of

the 36th International Conference on Software Engineering. ICSE Companion 2014.

Hyderabad, India: Association for Computing Machinery, 2014, pp. 504–507. ISBN:

9781450327688.DOI: 10.1145/2591062.2591111.

[4] Angel X. Chang and Christopher D. Manning. TOKENSREGEX: Deﬁning cascaded

regular expressions over tokens.

[5] Martin Erwig and Rahul Gopinath. “Explanations for Regular Expressions”. In:

Funda-mental Approaches to Software Engineering. Ed. by Juan de Lara and Andrea Zisman.

Berlin, Heidelberg: Springer Berlin Heidelberg, 2012, pp. 394–408. ISBN: 978-3-642-28872-2.

[6] Diwaker Gupta. “What is a Good First Programming Language?” In: XRDS 10.4 (Aug. 2004), p. 7.ISSN: 1528-4972.DOI: 10.1145/1027313.1027320.

[7] N. Hollmann and S. Hanenberg. “An Empirical Study on the Readability of Regular Expressions: Textual Versus Graphical”. In: 2017 IEEE Working Conference on

Soft-ware Visualization (VISSOFT). 2017, pp. 74–84.

[8] G. J. Holzmann. “Code Clarity”. In: IEEE Software 33.2 (2016), pp. 22–25.

[9] E. Larson and A. Kirk. “Generating Evil Test Strings for Regular Expressions”. In:

2016 IEEE International Conference on Software Testing, Veriﬁcation and Validation (ICST). 2016, pp. 309–319.

[10] Young Lee and Kai H. Chang. “Reusability and Maintainability Metrics for Object-Oriented Software”. In: Proceedings of the 38th Annual on Southeast Regional

Confer-ence. ACM-SE 38. Clemson, South Carolina: Association for Computing Machinery,

2000, pp. 88–94.ISBN: 1581132506.DOI: 10.1145/1127716.1127737.

[11] Yunyao Li, Rajasekar Krishnamurthy, Sriram Raghavan, Shivakumar Vaithyanathan, and H. V. Jagadish. “Regular Expression Learning for Information Extraction”. In:

Proceedings of the Conference on Empirical Methods in Natural Language Process-ing. EMNLP ’08. Honolulu, Hawaii: Association for Computational Linguistics, 2008,

pp. 21–30.

[12] L. G. Michael, J. Donohue, J. C. Davis, D. Lee, and F. Servant. “Regexes are Hard: Decision-Making, Difﬁculties, and Risks in Programming Regular Expressions”. In:

2019 34th IEEE/ACM International Conference on Automated Software Engineering (ASE). 2019, pp. 415–426.

(33)

[13] Torben Ægidius Mogensen. Introduction to compiler design. Springer, 2017.

[14] Zsolt Nagy. Regex Quick Syntax Reference: Understanding and Using Regular

Expres-sions. Apress, 2018.

[15] Gonzalo Navarro and Mathieu Rafﬁnot. “New techniques for regular expression search-ing”. In: Algorithmica 41.2 (2005), pp. 89–116.

[16] V. Romero and F. López. Mastering Python Regular Expressions. leverage regular

expressions in Python even for the most complex features. Birmingham: Packt Publ.,

2014.

[17] Simone Scalabrino, Mario Linares-Vásquez, Rocco Oliveto, and Denys Poshyvanyk. “A comprehensive model for code readability”. In: Journal of Software: Evolution and

Process 30.6 (2018), e1958.

[18] T. Sedano. “Code Readability Testing, an Empirical Study”. In: 2016 IEEE 29th

Inter-national Conference on Software Engineering Education and Training (CSEET). 2016,

pp. 111–117.

[19] Andreas Steﬁk and Susanna Siebert. “An Empirical Investigation into Programming Language Syntax”. In: ACM Trans. Comput. Educ. 13.4 (Nov. 2013). DOI: 10.1145/ 2534973.

[20] Mikael Svahnberg, Aybüke Aurum, and Claes Wohlin. “Using Students as Subjects - an Empirical Evaluation”. In: Proceedings of the Second ACM-IEEE International

Symposium on Empirical Software Engineering and Measurement. ESEM ’08.

Kaiser-slautern, Germany: Association for Computing Machinery, 2008, pp. 288–290. ISBN: 9781595939715.DOI: 10.1145/1414004.1414055.

[21] Claes Wohlin, Per Runeson, Martin Höst, Magnus C Ohlsson, Björn Regnell, and An-ders Wesslén. Experimentation in software engineering. Springer Science & Business Media, 2012.

(34)

Appendix A

Experiment tasks

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TASK 1 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For this task you will have to match a static pattern which is surrounded by ’noise’. Make sure the noise is not included in the matched pattern.

~~~~ API documentation suggestions ~~~~

"Literal text"

~~~~ Bugs & Workarounds ~~~~

No bugs found or workarounds needed in this task!

~~~~ Example of matched patterns ~~~~

[/<-\\.//->\] [/<-\\.//->\] [/<-\\.//->\] [/<-\\.//->\] [/<-\\.//->\]

~~~~ Example of input data ~~~~

[/<-\\.//->\]VOIDNovWHITE000101$$$"NULL $OctWHITEphi JunSunBLACKMay$$[/<-\\.//->\]Sun010011NULL 3.14153blahBLACKGreenBLACK111111Seppichipi3.1415 [/<-\\.//->\]NovFri[/<-\\.//->\] fooNULLNULL’2.8RedAugfubarBLACKpiBLACK2.8 Greenfoo$NILLfooWHITEYellowblah100111 WHITEMay3.1415[/<-\\.//->\]pipi chi[/<-\\.//->\]pi$$WHITE$$1110013.14153TueBLACK[/<-\\.//->\]taoRedBlue’ phiBLACKSunFeb’101100’

(35)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TASK 2 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For this task you will have to match a dynamic pattern which is surrounded by ’noise’. The pattern begins with a bracket, followed by four digits, and ends with two colons and a closing bracket. Make sure the noise is not included in the matched pattern.

"Literal text" "VARCHAR"

[9228::] [6884::] [4909::] [0718::]

phi010110Satfubar[9228::]Thu2.82.8Greentao JulNILL$$$001010Sep‘[6884::]OrangeMonRed$BLACKchi110110VOIDNULL 001110fubar 3.14"foo$$Red[4909::]taoNILLAprOcttaofubar[0718::]Jun SatRedfubar$$010001Purple111010[8788::][3885::]NULL[1455::]WHITE" Sun110010$$WHITE111001SatWHITESep[6655::]BLACKphichiHITE$$1110013 .14153TueBLACK[/<-\\.//->\]taoRedBlue’phiBLACKSunFeb’101100’

(36)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TASK 3 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This task is built upon a syslog dump that Jerry has provided from his MacBook Pro,and now he need help to find all processes and services that exists in the logfile.

The pattern for a process in this logfile is built from 1 to N upper- or lowercase alpha (a-z, A-Z) characters, followed by brackets containing a number (1 to N digits).

syslogd[88] syslogd[88] launchd[1] objc[16675] remindd[16675] launchd[1] AGMService[387]

Apr 12 09:12:59 Jerrys-MBP syslogd[88]: ASL Sender Statistics Apr 12 09:13:02 Jerrys-MacBook-Pro com.apple.xpc.launchd[1] Apr 12 09:13:02 Jerrys-MacBook-Pro com.apple.xpc.launchd[1] Apr 12 09:13:03 Jerrys-MacBook-Pro com.apple.xpc.objc[16675] Apr 12 09:13:03 Jerrys-MacBook-Pro com.apple.xpc.objc[16675]

Apr 12 09:13:07 Jerrys-MBP timed[126]: settimeofday({0x5e92bf83,0xbd5a1}) == 0 Apr 12 09:13:07 Jerrys-MBP timed[126]: settimeofday({0x5e92bf83,0xbd5a1}) == 0

(37)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TASK 4 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This task is built upon a syslog dump that Jerry has provided from his MacBook Pro. The objective for this task is to find all hexadecimal numbers in the file.

A hexadecimal number always begin with ’0x’ and follows by one or more characters valid for hexadecimal numnber (0 to 9, a-f).

0x5e92bf82 0x57eb1 0x7fce79c1f140 0x7fce79c1f140 0x7fff93b87508 0x10abe91f8

Apr 12 11:28:06 Jerrys-MBP timed[126]: settimeofday({0x5e92df26,0x56469}) Apr 12 11:28:06 Jerrys-MBP timed[126]: settimeofday({0x5e92df26,0x56469}) Apr 12 12:28:11 Jerrys-MBP syslogd[88]: ASL Sender Statistics

Apr 12 12:28:11 Jerrys-MBP syslogd[88]: ASL Sender Statistics

Apr 12 12:28:12 Jerrys-MBP timed[126]: settimeofday({0x5e92ed3c,0x400c0}) Apr 12 12:28:12 Jerrys-MBP timed[126]: settimeofday({0x5e92ed3c,0x400c0})

(38)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TASK 5 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This tasks objective is to find all errors, and the message that corresponds to it.The error messages always starts with ’error:’ and continues until the line ends.

"Literal text" "END"

"UNTIL"

error: container /dev/rdisk2 is mounted with write access; error: container /dev/rdisk3 is mounted with write access; error: container /dev/rdisk1 is mounted with write access;

/dev/rdisk1s2: fsck_apfs started at Sun Jan 19 19:42:38 2020 /dev/rdisk1s2: ** QUICKCHECK ONLY; FILESYSTEM CLEAN

/dev/rdisk1s4: error: container /dev/rdisk1 is mounted with write access; /dev/rdisk1s4: fsck_apfs completed at Sun Jan 19 19:42:38 2020

(39)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TASK 6 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For this task you will have to find all usernames in this /etc/passwd file. Each user has its own line, and the name is always in the beginning.

Valid characters for these usernames are both uppercase and lowercase alphanumeric characters, and underscore (’_’). The name must have at least one character and maximum 64 charactes.

"START" "VARCHAR"

nobody root

_taskgated

_installassistant:*:25:25:Install Assistant:/var/empty:/usr/bin/false _lp:*:26:26:Printing Services:/var/spool/cups:/usr/bin/false

(40)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TASK 7 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The objective for this task is to modify an already existing regex.

The user that wrote this regex in the past only wanted to match email adresses that ended with ’.com’ and ’.net’, but now it has to be extended to also match ’.se’ adresses.

The last thing to modify is the constraint of the domain name, now it has to be at least 4 characters or more for the pattern to be valid

"Literal text" "VARCHAR" "ANY" "GROUP"

bob@icloud.com Lily@sbcglobal.net mark@live.com

Name: Bob Email: bob@icloud.com Name: Lily Email: Lily@sbcglobal.net Name: Golum Email: golum@aol.com Name: Mark Email: mark@live.com

~~~~ Starting point Regify ~~~~

GROUP(

VARCHAR("A-Za-z0-9_", 1, MORE), # Username

@"@" VARCHAR("A-Za-z", 1, MORE), # Email domain name ANY(

@".com", @".net" )

(41)

~~~~ Starting point RegEx ~~~~

(?:[A-Za-z0-9_]{1,}@[A-Za-z]{1,}(?:\.com|\.net))

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TASK 8 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The regular expression provided are currently filtering on specific URL domains, which now has to be extended.

You will have to add the domain names ’pixabay’ and ’twitter’ to complete this task.

It also filters out the <span> HTML tags, and now you will have to add support to filter out the <h2> tags as well.

"Literal text" "VARCHAR" "ANY" "GROUP" "UNTIL"

INLINE has to be used in this task in order to match everything except a singe quote since it isn’t supported in regify at this state of development.

<h2>You guys think that a website doesn’t need some styling?</h2> <h2>The best kick-ass website</h2>

<h2>Well guess what, motherfucker:</h2> href=’https://imgur.com/gallery/u8asnn3’

<h1>This is the <i>best</i> -BEEP- website.</h1> <p class=’st’>Really, it is.</p>

<h2>You guys think that a website doesn’t need some styling?</h2> <p>You probably build websites using vim and feeling hardcore.<p> your 4.99KB <span class=’mfw’></span>

<p><span class=’wr’>WRONG</span>, .</p> <h2>The best website</h2>

<p>Let me describe the <i>real</i> perfect website <span class=’mfw’>websites</span>:</p>

<ul>

<li>This doesn’t weigh a ton (in fact it’s just 34.97 KB when the 27.83KB cat picture below is removed)</li>

(42)

<li>The page weighs exactly 63.02kB, 93.7% less than the <a href=’https://google.com/’>Google home page</a></li>

<li>Fits on your iPhone 1st gen (although it doesn’t work on your 16x32 Tamagotchi)</li>

GROUP( # Pattern for URLS with specific domain names @"href=’https://", # Will only match HTTPS websites

ANY( # Domain names to match @"github",

@"imgur" ),

INLINE @"[^’]*", # Match anything until a single # quote has been encountered @"’"

) OR

GROUP( # match all <span> HTML tags @"<span>",

UNTIL, @"</span>" )

(?:href=’https:\/\/(?:github|imgur)[^’]*’)|(?:<span>.*<\/span>)

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ TASK 9 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

In this task there is three patterns to be matched, which are surrounded by noise. Change the regular expression to also match the name ’Tom’, and the static pattern ’[<-\]’.

"Literal text" "VARCHAR" "ANY" "GROUP" "REPEAT"

(43)

[/Tom] .::,:: [<-\] .:;:,:: [/->]

.::,::phiVOIDWHITE"$$$ 3.14VOIDfubar3.1415Wed$$$Green WHITEfooBLACK[/Tom]fubar .::,::TueNULL$‘’fubar0[<-\]00100NILLWHITE$ 001010taoWHITE.:;:,::‘Nov011010[/->]3.14 3.1415$$110101pi$$$Greentao Tue$$$2.8

ANY( # Can be any of these three patterns REPEAT(2,

GROUP(

ANY( # Starts with comma or dot @",",

@"."

), # Second char is always colon VARCHAR(".;:", 2, 3)

) ),

GROUP( # Name pattern

@"[", # Opens with bracket

VARCHAR("\/", 1), # Can be forward or backslash ANY( # Any of these names

@"Edna", @"George", @"Phil", @"Harry" ),

@"]" # Ends with bracket ),

@"[/->]" # Static pattern )

(?:(?:(?:(?:,|\.)[\.;:]{2,3})(?:(?:,|\.)[\.;:]{2,3}))|(?:\[[\\\/] {1}(?:Edna|George|Phil|Harry)\])|(?:(?:\[\/->\])))

Modernizing the Syntax of Regular Expressions