top of page

Introduction to Regular Expression | Regular Expression Using Python

String Operations

In Python, a string is a sequence of characters. Text files are essentially a large string including white spaces, line returns and whatever special characters in the file. While processing strings and files, we normally need to look for patterns in strings and extract information that is relevant for our use.


For example, we use the find() method to look for the index location of a sub-string in a string variable:


mystring = 'But soft what light through yonder window breaks' substring = 'soft' mystring.find(substring) 

We can use string slicing methods to extract portion of a string, like the following examples:

txt = '0123456789' 
print txt[:5] 
print txt[3:] 

output:

01234

3456789


Combining both techniques, we then look for certain patterns in a string to extract (or parse) out relevant information, like this example:


mystr = 'Extract the number after colon :0.8475' 
number = mystr[mystr.find(':')+1:] 
print float(number) 

output:

0.8475


String Formatting

Formatting strings allows you to specify a template and to insert variables into the string pattern. It can also handle various ways to represent information – e.g. showing two decimal values, or showing a number as a percentage.


We will cover some examples here briefly. A good reference site for Python string formatting can be found here:


https://pyformat.info/ 

Formatting character strings basically means writing a string with special placeholder, and use the format method on the string to insert the values into the placeholder.


a = 'Value of variable A' 
b = 123 
print 'VAR a is {} and VAR B is {}'.format(a,b) 

output:

VAR a is Value of variable A and VAR B is 123


s = 'VAR a is {a} and VAR B is {b}' 
print s.format(a = 'Value of variable A', b = 123) 

output:

VAR a is Value of variable A and VAR B is 123


Number Formatting

We can add thousands-place comma separators when we format number:


print 'This is a large number {}'.format(2719249123) 
print 'This is a large number {:,}'.format(2719249123) 

output:

This is a large number 2719249123

This is a large number 2,719,249,123


We can also perform a calculation and format the result directly to a certain number of decimal values, or convert it into percentage:


# The 0 in {0:.4} refers to the 0 index in this format 
# the .4 means 4 decimal values 
print '7 divided by 67890 is {0:.4}'.format(7/67890.0) 

# if % is provided, the result is converted into percentage 
print '7 divided by 67890 is {0:.4%}'.format(7/67890.0) 

output:

7 divided by 67890 is 0.0001031

7 divided by 67890 is 0.0103%


Regular Expressions

Since it is common for programmers to look for text patterns and execute searches, Python has a library called regular expressions to handle these tasks. In this course, we cover the basics of regular expressions. You can get more details on regular expressions here:


The regular expressions library re needs to be imported first before we can use it. The example below demonstrates the use of the search() function. First, download the data file from https://www.py4e.com/code3/mbox-short.txt and save the file in the same directory where you run the Python program. The mbox-short.txt file looks like this:


From stephen.marquard@uct.ac.za Sat Jan 5 09:14:16 2008

Return-Path: <postmaster@collab.sakaiproject.org>

Received: from murder (mail.umich.edu [141.211.14.90])

by frankenstein.mail.umich.edu (Cyrus v2.3.8) with LMTPA;

Sat, 05 Jan 2008 09:14:16 -0500

X-Sieve: CMU Sieve 2.3

Received: from murder ([unix socket])

by mail.umich.edu (Cyrus v2.2.12) with LMTPA;

Sat, 05 Jan 2008 09:14:16 -0500

Received: from holes.mr.itd.umich.edu (holes.mr.itd.umich.edu

[141.211.14.79])

by flawless.mail.umich.edu () with ESMTP id m05EEFR1013674;

Sat, 5 Jan 2008 09:14:15 -0500

Received: FROM paploo.uhi.ac.uk (app1.prod.collab.uhi.ac.uk

[194.35.219.184])

BY holes.mr.itd.umich.edu ID 477F90B0.2DB2F.12494 ;

5 Jan 2008 09:14:10 -0500

Received: from paploo.uhi.ac.uk (localhost [127.0.0.1])

by paploo.uhi.ac.uk (Postfix) with ESMTP id 5F919BC2F2;

Sat, 5 Jan 2008 14:10:05 +0000 (GMT)

Message-ID:

<200801051412.m05ECIaH010327@nakamura.uits.iupui.edu>

Mime-Version:


Run the following program to search for all text lines that contain the word ‘From:’

import re
hand = open('mbox-short.txt') 
for line in hand:
 	line = line.rstrip()
 	if re.search('From:', line):
 	print(line)

The power of regular expressions lies in the ability to add special characters in the search string, so that we can precisely control what to be matched. A skilful programmer can write regular expressions with very little code to achieve sophisticated matching and extraction.


For example, the caret (^) character is used in regular expressions to match “beginning” of a line. By changing the search string from 'From:' to '^From:', our program will now search and match only lines where “From” is at the beginning of the line:

import re
hand = open('mbox-short.txt') 
for line in hand:
 	line = line.rstrip()
 	if re.search('From:', line):
 	print(line)

The example here, while trivial (since we can use find() or startswith() methods to do pretty much the same time), serves to illustrate the idea that regular expressions use special characters to control precisely the required match patterns.


Regular Expressions Special Characters

The Regex Cheat Sheet at http://www.rexegg.com/regex-quickstart.html provides a good reference of most regular expressions. Here is a partial listing:


Character Legend Example Sample Match

\d one digit from 0 to 9 \d\d 25

\w ASCII letter, digit or underscore \w\w\w B_1

\s space, tab, newline, carriage a\sb\sc a b c

return, vertical tab

\S one character that is \S\S\S\S a3d_

not whitespace

. any character x.y xry or x1y

\ Escapes a special character \.\{\} .{}

+ One or more \w--\w+ A--b1_1

{3} Exactly three times \w{3} Aa_

* Zero or more times A*B*C* AAACCC

? Once or none plurals? plural


Using these reference, we can now examine the matching patterns of the following expressions:


Expression Match

F..m: From: Favm: F_!m:

From:.+@ From:123@ From:a@ From:BOB@ From:uct.ac@

\S+@\S+ ab@yx.com 123@34.56.com


“\S+@\S+” here means one or more non-white space character, followed by “@”, followed by one or more non-white space character. We can now make use of the last regular expression “\S+@\S+” to search for email addresses in our sample text. Let us run the following program:

import re
hand = open('mbox-short.txt') 
for line in hand:
	line = line.rstrip()
	x = re.findall('\S+@\S+', line) 
	if len(x) > 0:
		print(x)

The method findall() returns a list. When the number of elements in the list is more than one, we print the list out to see the value.


When we run the program, we see that the output contains email addresses with incorrect characters such as “<” or “;” at the beginning or end. Like real world problem, it means that we need to modify our parsing method in order to get the results that will be useful for our purpose.


In this context, we need to modify the search pattern to state that the first and the last character of the email string must be either an upper or lower case letter, or a number. In regular expression, the format “[a-zA-Z0-9]” fulfills the purpose. The revised regular expression is:


[a-zA-Z0-9]\S*@\S*[a-zA-Z0-9] 

Notice that in the expression above, we have switched from “+” to “*”. “*” means zero or more non-white space characters. If we use “+”, then the expression would have indicated we need at least two characters before and after the “@” character, which is different from the search pattern of the original regular expression (at least one character before and after “@”).


The code with the revised regular expression should produce output as follows:


['stephen.marquard@uct.ac.za'] 
['postmaster@collab.sakaiproject.org'] ['200801051412.m05ECIaH010327@nakamura.uits.iupui.edu'] ['source@collab.sakaiproject.org'] 

If we want to extract just the domain name of the email addresses, we can simply add parentheses to the regular expression, like this: '[a-zA-Z0-9]\S*@(\S*[a-zAZ0-9])'


When used with findall(), the parentheses will indicate the portion of the searched items that we want to return as data.


If we run the following codes:

import re
hand = open('mbox-short.txt') 
for line in hand:
	line = line.rstrip()
	x = re.findall('[a-zA-Z0-9]\S*@(\S*[a-zA-Z0-9])', line) 
	if len(x) > 0:
		print(x)

The output will be the list of domain names extracted from the emails.



Regular Expressions - Escape Characters

While special characters are used in regular expression to match beginning of line or to specify wild cards, there will be instances when we need to actually match these characters (eg a dollar sign) in the string.


We can indicate this by prefixing that character with a backslash. This will indicate that the character has to be matched as part of the string pattern, as shown by the following example:

import re
x = 'We just received $10.00 for cookies.'
y = re.findall('\$[0-9.]+',x)
print y

“\$” means treat “$” as the first character to be matched in the regular expression.




To get any help related to Regular expression programming then you can contact us or send your requirement details at:


realcode4you@gmail.com
bottom of page