-Joseph Beberman *Some slides are inspired by a PowerPoint presentation used by professor Seikyung Jung, which was derived from Charlie Wiseman. Regular expressions are all over the place. All syntaxes are almost identical, but for what it’s worth I will be using the syntax tied to Unix systems. In computer science, regular expressions are used to locate strings based on a pattern. Search for every email address in a file? Regular expressions make it easy. Often referred to as regexp, regex, etc. For instance: A phone number is three digits, followed by a dash, three digits, a dash, and then four digits (“555-5555555”). You can make a regular expression which matches all phone numbers by indicating that pattern. For reference: [0-9]{3}-[0-9]{3}-[0-9]{4} Consider values of currency. How could you describe in English any/all monetary values (including cents) with a single pattern? ie. $50.25 Consider values of currency. How could you describe in English any/all monetary values (including cents) with a single pattern? ie. $50.25 Dollar sign, any number of digits, period, two digits. Consider values of currency. How could you describe in English any/all monetary values (including cents) with a single pattern? ie. $50.25 Dollar sign, any number of digits, period, two digits. For reference: \$[0-9]+\.[0-9]{2} Using it to showcase regular expressions. ◦ It actually stands for Global Regular Expression Parser A command available on most if not all Unixlike systems. Seem to be incredibly popular command for system administrators. grep is used to do text-based searching, generally on the Linux command line or in scripting Takes two arguments Generic format: grep STRING FILE It prints every line of FILE that has STRING in it. Example: grep root /etc/passwd ◦ »Prints out all lines in the /etc/passwd file that contain the string "root" The contents of /etc/passwd grep ‘root’ /etc/passwd What does this tell you? ◦ 2 lines contain the string “root” ◦ Highlights exactly where the string was matched Grep has a number of options, and even though it’s off topic knowing some may help you understand the power of grep/regexps. -i » ignore case -v » negation ◦ grep –v hello filename.txt Would return every line of filename.txt without the word hello in it. How does grep use regular expressions? ◦ Again: stands for Global Regular Expression Parser Recall the format: grep STRING FILE The STRING is actually interpreted as a regular expression. Note: I will be using the –E option for grep ◦ Don’t worry about it, it essentially enables all regexp functionality. First thing’s first… we need a text file to search! I’ve taken the time to make a simple text file which will help me show some simpler regular expressions. How do you up your game from literal strings like “root”, to creating patterns? Regexs have their own syntax. To start: parenthesis are used for grouping “or statements”. To match one thing or something else, you group them in parenthesis and separate them with pipes. ◦ (joe|Joe) will match the string “joe” or “Joe” ◦ (hello|goodbye|sup) matches “”hello” “goodbye” or “sup” You can specify a range of characters within brackets. ◦ For example [a-z] will match any lower case letter. ◦ [A-Z] any upper case letter ◦ [0-9] any digit Now the pattern is any digit. Now the pattern is digits 0 to 5. You can match one thing after another. ◦ For example: [a-z][0-9] will match any lower case letter followed by a number. Now we are starting to see patterns! When specifying one range or another, you don’t need a pipe. ◦ For example [a-zA-Z] will match any lower or upper case letter. ◦ [0-9a-zA-Z] will match any alphanumeric character Now it’s time to get more specific. What if you want to find something that occurs multiple times in a row? The +, *, ?, and {} special characters specify how many times you want the pattern directly in front of them to occur. ◦ Ex. [a-zA-Z]+ ◦ The + modifies the grouping in front of it + » one or more instances ◦ [a-zA-Z]+ would match any string of lower/upper case letters at least 1 letter long. * » zero or more instances ◦ [0-9]* would match any number of digits, or none at all. ? » zero or one instance (aka optional) ◦ [a-zA-Z]+ would match a single letter or none at all. [a-z]+[0-9]*[A-Z]? ◦ ade7E ◦ cpB ◦ F12CP X ◦ Please ask questions here if you’re confused! {} » specific or range ◦ {3} or {4,7} ◦ ‘[0-9]{3}-[0-9]{3}-[0-9]{4}’ for a phone number Now we can make a regular expression that matches emails! Let’s try now… Now we can make a regular expression that matches emails! Let’s try now… Any alphanumeric sequence, @, any alphabetical sequence, ., any lower case sequence Now we can make a regular expression that matches emails! Let’s try now… Any alphanumeric sequence, @, any alphabetical sequence, ., any lower case sequence ‘[a-zA-Z0-9]+’ Now we can make a regular expression that matches emails! Let’s try now… Any alphanumeric sequence, @, any alphabetical sequence, ., any lower case sequence ‘[a-zA-Z0-9]+@’ Now we can make a regular expression that matches emails! Let’s try now… Any alphanumeric sequence, @, any alphabetical sequence, ., any lower case sequence ‘[a-zA-Z0-9]+@[a-zA-Z]+’ Now we can make a regular expression that matches emails! Let’s try now… Any alphanumeric sequence, @, any alphabetical sequence, ., any lower case sequence ‘[a-zA-Z0-9]+@[a-zA-Z]+.’ Now we can make a regular expression that matches emails! Let’s try now… Any alphanumeric sequence, @, any alphabetical sequence, ., any lower case sequence ‘[a-zA-Z0-9]+@[a-zA-Z]+.[a-z]+’ Weird… why did we match that third line? . Is a special character which takes the place of anything. That means ‘t.o’ would match two, too, t2o, or many other things. That’s how it matched below. The . matched 0! So how do we avoid matching weird things like j03b130@h0tma? ◦ [a-zA-Z0-9]+@[a-zA-Z]+\.[a-z]+’ You can escape special characters by putting \ in front of them. ◦ So \. means a literal period. ◦ Note: Escape \ by putting \ in front of it! \\ So \\ means a literal back slash. ◦ Double Note: the space character is matched by \s effectively escaping the s character. ^ » Indicates the start of a line Notice how it didn’t match ever line with “I” in it, only the ones which start with I. Vs. $ » indicates end of a line Syntax: ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ◦ ^ start line $ end line + one or more * zero or more ? zero or one . replace with anything {n} n times {n, m} n to m times (string1|string2) matches string1 or string2 What does this match? [0-9]{3}-[0-9]{3}-[0-9]{4} What does this match? [0-9]{3}-[0-9]{3}-[0-9]{4} Phone numbers! What does this match? \$[0-9]+\.[0-9]{2} Money values Example: What does this match? ‘(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)' Example: What does this match? ‘(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?\.){3}(25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)‘ That actually matches valid IP addresses. (I found it online though. Credit to SASIKALA of thegeekstuff.com) Regular expressions simply indicate a pattern. What is important is that the pattern can be searched for as opposed to a literal string. That means instead of searching for a specific phone number string input, you can search for any existing phone number with ease by matching the pattern that all phone numbers follow. Common tasks that regular expressions are used for: It finds strings that match a given syntax. ◦ -Ctrl-F, anyone? There are tools to add regular expression functionality to Ctrl-F, at least on Chrome. ◦ -Tool: Regular Expression Searcher Once you find said strings based on the pattern, there are limitless possibilities as to what you can do with those matches. Substitution: Replace all matching strings. ◦ -Ctrl-H (on word), anyone? Splitting: Split strings based upon matches.