Learn Regular Expressions in Python: A Comprehensive Guide for Beginners

Introduction

Welcome to PythonSage! In this post, we'll dive deep into the world of regular expressions (regex) in Python. Regular expressions are a powerful tool for string manipulation and data extraction. Whether you're a beginner or looking to sharpen your skills, this guide will cover everything you need to master regex in Python.

Introduction to Regular Expressions

Regular expressions are sequences of characters that define search patterns. They are commonly used for string matching, validation, and text processing. In Python, the ‘re’ module provides support for working with regular expressions.

Basics of Regular Expressions

Let's start with some basic concepts and syntax of regular expressions.

Metacharacters

Metacharacters are special characters that have a unique meaning in regex:

. : Matches any character except a newline.
^ : Matches the start of the string.
$ : Matches the end of the string.
* : Matches 0 or more repetitions of the preceding pattern.
+ : Matches 1 or more repetitions of the preceding pattern.
? : Matches 0 or 1 repetition of the preceding pattern.
{} : Matches a specific number of repetitions of the preceding pattern.

Character Classes

Character classes allow you to define a set of characters to match:

[abc] : Matches any one of the characters a, b, or c.
[a-z] : Matches any lowercase letter.
[A-Z] : Matches any uppercase letter.
[0-9] : Matches any digit.
\d : Matches any digit (equivalent to [0-9]).
\D : Matches any non-digit.
\w : Matches any word character (equivalent to [a-zA-Z0-9_]).
\W : Matches any non-word character.
\s : Matches any whitespace character.
\S : Matches any non-whitespace character.

Anchors

Anchors are used to specify the position within a string:

\b : Matches a word boundary.
\B : Matches a non-word boundary.

Groups and Alternation

Groups and alternation allow for more complex patterns:

(abc) : Matches the exact string "abc".
| : Acts as an OR operator (e.g., a|b matches "a" or "b").

Using Regular Expressions in Python

Let's explore how to use regular expressions in Python with the ‘re’ module.

Importing the ‘re’ Module

import re

Basic Functions

re.match()

The re.match() function checks for a match only at the beginning of the string.

import re

pattern = r'^hello'

text = 'hello world'

match = re.match(pattern, text)

 

if match:

    print("Match found!")

else:

    print("No match found.")

Explanation: In this example, the pattern ^hello checks if the string starts with "hello". Since the text "hello world" does start with "hello", it prints "Match found!".

re.search()

The re.search() function searches the entire string for a match.



import re

pattern = r'world'

text = 'hello world'

match = re.search(pattern, text)

 

if match:

    print("Match found!")

else:

    print("No match found.")

Explanation: Here, the pattern world is searched throughout the entire string. Since "world" is found in "hello world", it prints "Match found!".

re.findall()

The re.findall() function returns all non-overlapping matches of the pattern in the string.



import re

pattern = r'\d+'

text = 'There are 12 apples and 24 oranges.'

matches = re.findall(pattern, text)

 

print(matches)  # Output: ['12', '24']

Explanation: The pattern \d+ matches one or more digits. The function re.findall() finds all occurrences of this pattern in the text, returning ['12', '24'].

re.sub()

The re.sub() function replaces the matches with the specified replacement string.



import re

pattern = r'apple'

text = 'I have an apple and another apple.'

result = re.sub(pattern, 'orange', text)

 

print(result)  # Output: 'I have an orange and another orange.'

Explanation: The pattern apple is replaced with "orange" in the text. The function re.sub() performs this replacement, resulting in "I have an orange and another orange.".

Compiling Regular Expressions

You can compile regular expressions for better performance, especially if you're using the same pattern multiple times.



import re

pattern = re.compile(r'\d+')

text = 'There are 12 apples and 24 oranges.'

matches = pattern.findall(text)

 

print(matches)  # Output: ['12', '24']

Explanation: Compiling the pattern \d+ using re.compile() creates a regex object. This object is then used to find all digit sequences in the text, returning ['12', '24'].

Using Groups

Groups allow you to extract specific parts of the match.



import re

pattern = r'(\d+)\s(apples|oranges)'

text = 'There are 12 apples and 24 oranges.'

matches = re.findall(pattern, text)

 

print(matches)  # Output: [('12', 'apples'), ('24', 'oranges')]

Explanation: The pattern (\d+)\s(apples|oranges) has two groups: (\d+) for digits and (apples|oranges) for either "apples" or "oranges". The function re.findall() returns a list of tuples, each containing the matched groups: [('12', 'apples'), ('24', 'oranges')].

Using Named Groups

Named groups make the regex more readable and allow you to access groups by name.



import re

 

pattern = r'(?P<number>\d+)\s(?P<fruit>apples|oranges)'

text = 'There are 12 apples and 24 oranges.'

matches = re.finditer(pattern, text)

 

for match in matches:

    print(match.group('number'), match.group('fruit'))

Explanation: Named groups (?P<number>\d+) and (?P<fruit>apples|oranges) are used to capture digits and fruit names. The re.finditer() function returns an iterator yielding match objects. Each match object allows accessing the groups by their names, printing 12 apples and 24 oranges.

Advanced Regular Expressions

Lookahead and Lookbehind

Lookaheads and lookbehinds are zero-width assertions that allow you to match a pattern only if it's followed or preceded by another pattern.

Positive Lookahead

Matches if the specified pattern is followed by another pattern.



import re

 

pattern = r'\d+(?= apples)'

text = 'There are 12 apples and 24 oranges.'

matches = re.findall(pattern, text)

 

print(matches)  # Output: ['12']

Explanation: The pattern \d+(?= apples) matches digits only if they are followed by " apples". The re.findall() function returns ['12'], as only "12" is followed by " apples".

Negative Lookahead

Matches if the specified pattern is not followed by another pattern.



import re

 

pattern = r'\d+(?! apples)'

text = 'There are 12 apples and 24 oranges.'

matches = re.findall(pattern, text)

 

print(matches)  # Output: ['24']

Explanation: The pattern \d+(?! apples) matches digits only if they are not followed by " apples". The re.findall() function returns ['24'], as "24" is not followed by " apples".

Positive Lookbehind

Matches if the specified pattern is preceded by another pattern.



import re

 

pattern = r'(?<=There are )\d+'

text = 'There are 12 apples and 24 oranges.'

matches = re.findall(pattern, text)

 

print(matches)  # Output: ['12']

Explanation: The pattern (?<=There are )\d+ matches digits only if they are preceded by "There are ". The re.findall() function returns ['12'], as only "12" is preceded by "There are ".

Negative Lookbehind

Matches if the specified pattern is not preceded by another pattern.



import re

 

pattern = r'(?<!There are )\d+'

text = 'There are 12 apples and 24 oranges.'

matches = re.findall(pattern, text)

 

print(matches)  # Output: ['24']

Explanation: The pattern (?<!There are )\d+ matches digits only if they are not preceded by "There are ". The re.findall() function returns ['24'], as "24" is not preceded by "There are ".

Practical Examples

Email Validation


import re

 

pattern = r'^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$'

email = 'example@example.com'

 

if re.match(pattern, email):

    print("Valid email address.")

else:

    print("Invalid email address.")

Explanation: The pattern ^[a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+$ validates an email address. It checks for one or more alphanumeric characters or special characters before the "@" symbol, followed by a domain name and a valid top-level domain.

Extracting Dates



import re

 

pattern = r'(\d{2})/(\d{2})/(\d{4})'

text = 'The event is on 25/12/2024.'

matches = re.findall(pattern, text)

 

print(matches)  # Output: [('25', '12', '2024')]

Explanation: The pattern (\d{2})/(\d{2})/(\d{4}) captures dates in the format DD/MM/YYYY. The re.findall() function returns a list of tuples with the day, month, and year: [('25', '12', '2024')].

Phone Number Formatting


import re

 

pattern = r'(\d{3})-(\d{3})-(\d{4})'

text = 'My phone number is 123-456-7890.'

formatted_number = re.sub(pattern, r'(\1) \2-\3', text)

 

print(formatted_number)  # Output: 'My phone number is (123) 456-7890.'

Explanation: The pattern (\d{3})-(\d{3})-(\d{4}) captures phone numbers in the format XXX-XXX-XXXX. The re.sub() function replaces this pattern with the formatted string (\1) \2-\3, resulting in "(123) 456-7890".

Conclusion

Regular expressions are a powerful tool for text processing and data extraction. With the ‘re’ module in Python, you can leverage the full potential of regex to perform complex string manipulations. Practice with different patterns and use cases to become proficient in using regular expressions.

External Links

RegularExpressions Documentation

Regular Expressions 101 (Interactive Tool)

For more tips, tutorials, and Python projects, stay tuned to PythonSage.

Happy coding!

Learn Regular Expressions in Python: A Comprehensive Guide for Beginners

Post a Comment