Introduction
Welcome to PythonSage! In this post,
we'll dive deep into the world of regular expressions (regex) in Python.
Regular expressions are a powerful tool for string manipulation and data
extraction. Whether you're a beginner or looking to sharpen your skills, this
guide will cover everything you need to master regex in Python.
Introduction to Regular Expressions
Regular expressions are sequences of
characters that define search patterns. They are commonly used for string
matching, validation, and text processing. In Python, the ‘re’ module provides
support for working with regular expressions.
Basics of Regular Expressions
Let's start with some basic concepts and
syntax of regular expressions.
Metacharacters
Metacharacters are special characters
that have a unique meaning in regex:
.
: Matches any character except a newline.^
: Matches the start of the string.$
: Matches the end of the string.*
: Matches 0 or more repetitions of the preceding pattern.+
: Matches 1 or more repetitions of the preceding pattern.?
: Matches 0 or 1 repetition of the preceding pattern.{}
: Matches a specific number of repetitions of the preceding pattern.
Character Classes
Character classes allow you to define a
set of characters to match:
[abc]
: Matches any one of the characters a, b, or c.[a-z]
: Matches any lowercase letter.[A-Z]
: Matches any uppercase letter.[0-9]
: Matches any digit.\d
: Matches any digit (equivalent to[0-9]
).\D
: Matches any non-digit.\w
: Matches any word character (equivalent to[a-zA-Z0-9_]
).\W
: Matches any non-word character.\s
: Matches any whitespace character.\S
: Matches any non-whitespace character.
Anchors
Anchors are used to specify the position within
a string:
\b
: Matches a word boundary.\B
: Matches a non-word boundary.
Groups and Alternation
Groups and alternation allow for more
complex patterns:
(abc)
: Matches the exact string "abc".|
: Acts as an OR operator (e.g.,a|b
matches "a" or "b").
Using Regular Expressions in Python
Let's explore how to use regular
expressions in Python with the ‘re’ module.
Importing the ‘re’ Module
import re
Basic Functions
re.match()
The re.match() function checks for a match only at the beginning of the string.
import re
pattern = r'^hello'
text = 'hello world'
match = re.match(pattern, text)
if match:
print("Match found!")
else:
print("No match found.")
Explanation: In this example, the pattern ^hello checks if the string starts with "hello". Since the text "hello world" does start with "hello", it prints "Match found!".
re.search()
The re.search() function searches the
entire string for a match.
import re
pattern = r'world'
text = 'hello world'
match = re.search(pattern, text)
if match:
print("Match found!")
else:
print("No match found.")
Explanation: Here, the pattern world is searched throughout the entire string. Since "world" is found in "hello world", it prints "Match found!".
re.findall()
The re.findall() function returns all
non-overlapping matches of the pattern in the string.
import re
pattern = r'\d+'
text = 'There are 12 apples and 24 oranges.'
matches = re.findall(pattern, text)
print(matches) # Output: ['12', '24']
Explanation: The pattern \d+
matches one or more digits. The function re.findall() finds all occurrences of
this pattern in the text, returning ['12', '24'].
re.sub()
The re.sub() function replaces the
matches with the specified replacement string.
import re
pattern = r'apple'
text = 'I have an apple and another apple.'
result = re.sub(pattern, 'orange', text)
print(result) # Output: 'I have an orange and another orange.'
Explanation: The pattern apple is replaced with "orange" in the text.
The function re.sub() performs this replacement, resulting in "I have an
orange and another orange.".
Compiling Regular Expressions
You can compile regular expressions for
better performance, especially if you're using the same pattern multiple times.
import re
pattern = re.compile(r'\d+')
text = 'There are 12 apples and 24 oranges.'
matches = pattern.findall(text)
print(matches) # Output: ['12', '24']
Explanation: Compiling the pattern \d+ using re.compile() creates a regex
object. This object is then used to find all digit sequences in the text,
returning ['12', '24'].
Using Groups
Groups allow you to extract specific
parts of the match.
import re
pattern = r'(\d+)\s(apples|oranges)'
text = 'There are 12 apples and 24 oranges.'
matches = re.findall(pattern, text)
print(matches) # Output: [('12', 'apples'), ('24', 'oranges')]
Explanation: The pattern (\d+)\s(apples|oranges) has two groups: (\d+) for digits and (apples|oranges) for either "apples" or "oranges". The function re.findall() returns a list of tuples, each containing the matched groups: [('12', 'apples'), ('24', 'oranges')].
Using Named Groups
Named groups make the regex more readable
and allow you to access groups by name.
import re
pattern = r'(?P<number>\d+)\s(?P<fruit>apples|oranges)'
text = 'There are 12 apples and 24 oranges.'
matches = re.finditer(pattern, text)
for match in matches:
print(match.group('number'), match.group('fruit'))
Explanation: Named groups (?P<number>\d+) and (?P<fruit>apples|oranges) are used to capture digits and fruit names. The re.finditer() function returns an iterator yielding match objects. Each match object allows accessing the groups by their names, printing 12 apples and 24 oranges.
Advanced Regular Expressions
Lookahead and Lookbehind
Lookaheads and lookbehinds are zero-width
assertions that allow you to match a pattern only if it's followed or preceded
by another pattern.
Positive Lookahead
Matches if the specified pattern is
followed by another pattern.
import re
pattern = r'\d+(?= apples)'
text = 'There are 12 apples and 24 oranges.'
matches = re.findall(pattern, text)
print(matches) # Output: ['12']
Explanation: The pattern \d+(?= apples) matches
digits only if they are followed by " apples". The re.findall()
function returns ['12'], as only "12" is followed by "
apples".
Negative Lookahead
Matches if the specified pattern is not
followed by another pattern.
import re
pattern = r'\d+(?! apples)'
text = 'There are 12 apples and 24 oranges.'
matches = re.findall(pattern, text)
print(matches) # Output: ['24']
Explanation: The pattern \d+(?! apples) matches
digits only if they are not followed by " apples". The re.findall()
function returns ['24'], as "24" is not followed by "
apples".
Positive Lookbehind
Matches if the specified pattern is
preceded by another pattern.
import re
pattern = r'(?<=There are )\d+'
text = 'There are 12 apples and 24 oranges.'
matches = re.findall(pattern, text)
print(matches) # Output: ['12']
Explanation: The pattern (?<=There are )\d+
matches digits only if they are preceded by "There are ". The re.findall()
function returns ['12'], as only "12" is preceded by "There are
".
Negative Lookbehind
Matches if the specified pattern is not
preceded by another pattern.
import re
pattern = r'(?<!There are )\d+'
text = 'There are 12 apples and 24 oranges.'
matches = re.findall(pattern, text)
print(matches) # Output: ['24']
Explanation: The pattern (?<!There are )\d+
matches digits only if they are not preceded by "There are ". The re.findall()
function returns ['24'], as "24" is not preceded by "There are
".
Practical Examples
Email Validation
import re
pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'
email = 'example@example.com'
if re.match(pattern, email):
print("Valid email address.")
else:
print("Invalid email address.")
Explanation: The pattern ^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$
validates an email address. It checks for one or more alphanumeric characters
or special characters before the "@" symbol, followed by a domain
name and a valid top-level domain.
Extracting Dates
import re
pattern = r'(\d{2})/(\d{2})/(\d{4})'
text = 'The event is on 25/12/2024.'
matches = re.findall(pattern, text)
print(matches) # Output: [('25', '12', '2024')]
Explanation: The pattern (\d{2})/(\d{2})/(\d{4}) captures dates in the format DD/MM/YYYY. The re.findall() function returns a list of tuples with the day, month, and year: [('25', '12', '2024')].
Phone Number Formatting
import re
pattern = r'(\d{3})-(\d{3})-(\d{4})'
text = 'My phone number is 123-456-7890.'
formatted_number = re.sub(pattern, r'(\1) \2-\3', text)
print(formatted_number) # Output: 'My phone number is (123) 456-7890.'
Explanation: The pattern (\d{3})-(\d{3})-(\d{4}) captures phone numbers in the format XXX-XXX-XXXX. The re.sub() function replaces this pattern with the formatted string (\1) \2-\3, resulting in "(123) 456-7890".
Conclusion
Regular expressions are a powerful tool
for text processing and data extraction. With the ‘re’ module in Python, you
can leverage the full potential of regex to perform complex string
manipulations. Practice with different patterns and use cases to become proficient
in using regular expressions.
External Links
RegularExpressions Documentation
Regular Expressions 101 (Interactive Tool)
For more tips, tutorials, and Python projects, stay tuned to PythonSage.
Happy coding!