Creating a Resume Parser with Python and NLP

Introduction
Prerequisites
Setting Up
Building the Resume Parser
Conclusion

Introduction

In today’s digital age, analyzing resumes manually can be a time-consuming task. With the help of Python and Natural Language Processing (NLP), we can automate the process of resume parsing. In this tutorial, you will learn how to create a resume parser using Python and NLP. The parser will extract important information from a resume, such as contact details, education, skills, and work experience.

By the end of this tutorial, you will have a working resume parser that can save you valuable time and effort when handling large volumes of resumes.

Prerequisites

To follow along with this tutorial, you should have a basic understanding of Python programming and the fundamentals of Natural Language Processing (NLP). Familiarity with the following Python libraries will also be beneficial:

NLTK (Natural Language Toolkit)
spaCy (Industrial-strength NLP library)
pdfplumber (PDF parsing library)

You will also need to have Python and pip (Python package installer) installed on your machine.

Setting Up

Create a new directory for your project by opening your preferred terminal or command prompt and executing the following command:
```
 mkdir resume_parser
```
Navigate to the newly created directory:
```
 cd resume_parser
```
Initialize a new virtual environment to keep dependencies isolated:
```
 python -m venv venv
```
Activate the virtual environment:

On Windows: bash venv\Scripts\activate On macOS/Linux: bash source venv/bin/activate

Install the required libraries:
```
 pip install nltk spacy pdfplumber
```
Download the spaCy English language model by executing the following command:
```
 python -m spacy download en_core_web_sm
```
With the environment set up, we can now proceed to build the resume parser.

Building the Resume Parser

Step 1: Installing Libraries

We will be using several libraries in this tutorial. Install them by including the following import statements at the top of your Python script: ```python import pdfplumber import nltk import spacy

from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from spacy.matcher import Matcher
``` ### Step 2: Loading the Resume

Before we can extract information from a resume, we need to load the document using the pdfplumber library. Here’s an example of how to load a PDF resume named “resume.pdf” located in the same directory as your Python script: ```python with pdfplumber.open(“resume.pdf”) as pdf: resume_text = “”

    for page in pdf.pages:
        resume_text += page.extract_text()
``` ### Step 3: Preprocessing the Resume

To prepare the resume for information extraction, we need to preprocess the text. This involves converting the text to lowercase, tokenizing it into individual words, and removing stopwords (common words that do not carry much meaning). Here’s how you can preprocess the resume text using NLTK: ```python nltk.download(‘punkt’) nltk.download(‘stopwords’)

# Convert text to lowercase
resume_text = resume_text.lower()

# Tokenize the text into individual words
tokens = word_tokenize(resume_text)

# Remove stopwords from the tokenized list
stop_words = set(stopwords.words("english"))
filtered_tokens = [word for word in tokens if word.isalnum() and not word in stop_words]
``` ### Step 4: Extracting Information

Now that the resume text is preprocessed, we can use spaCy to perform named entity recognition (NER) and extract information such as names, email addresses, educational institutions, and work experience. Here’s an example of how to extract email addresses from the resume: ```python nlp = spacy.load(“en_core_web_sm”) doc = nlp(resume_text)

# Define a pattern to match email addresses
email_pattern = [{"LIKE_EMAIL": True}]

matcher = Matcher(nlp.vocab)
matcher.add("Email", [email_pattern])

# Find all email addresses in the resume
matches = matcher(doc)
email_addresses = []

for match_id, start, end in matches:
    email_addresses.append(doc[start:end].text)
``` You can create similar patterns and matchers to extract other types of information like names and phone numbers.

Step 5: Displaying the Results

Finally, we can display the extracted information to the user. Here’s an example of how to print the extracted email addresses: ```python print(“Extracted Email Addresses:”)

for email in email_addresses:
    print(email)
``` Feel free to modify the code to display other extracted information according to your needs.

Conclusion

In this tutorial, you learned how to create a resume parser using Python and NLP. By leveraging libraries such as pdfplumber, NLTK, and spaCy, you can automate the process of extracting important information from resumes. You can further enhance the parser by extending the patterns and matchers to extract additional types of information.

Automating the resume parsing process can save a significant amount of time and effort when handling large volumes of resumes.

Published: 4 August 2019