CV Parsing & Extraction

Overview

The CV Parsing & Extraction system automates the process of reading uploaded resumes and identifying candidate strengths. By leveraging asynchronous background processing, the system extracts raw text from PDF files and matches them against industry-standard keywords, providing administrators with an immediate "Match Score" for every application.

The Parsing Workflow

The extraction process is triggered automatically upon form submission. The workflow follows these steps:

Storage: The applicant's PDF is uploaded to the Supabase cvs storage bucket.
Trigger: The frontend invokes the parse-cv Edge Function with the file path and applicant ID.
Extraction: The Edge Function uses pdfjs-dist to parse the PDF layers and compile raw text.
Analysis: The system scans the text for specific job-related keywords.
Persistence: The extracted text, matched keywords, and final match score are saved back to the applicants table in the database.

Edge Function: `parse-cv`

The core logic resides in a Supabase Edge Function. This allows the heavy lifting of PDF processing to happen off the main browser thread, ensuring a smooth user experience for the applicant.

API Interface

To manually trigger or integrate the parser, use the following interface:

Endpoint: parse-cv
Method: POST

Request Body:

{
  applicantId: string;   // The UUID of the applicant in the database
  cvFilePath: string;    // The path to the file in the 'cvs' storage bucket
  customKeywords?: string[]; // Optional: additional keywords to look for
}

Example Usage:

const { data, error } = await supabase.functions.invoke("parse-cv", {
  body: {
    applicantId: "123-abc",
    cvFilePath: "171589200-resume.pdf",
  },
});

Keyword Matching & Scoring

The system evaluates resumes based on a predefined library of keywords relevant to Japanese language proficiency and technical roles.

Matching Logic

Word Boundary Detection: The parser uses regular expressions to ensure keywords are matched as whole words (e.g., matching "Java" but not "Javascript" unless "Javascript" is also a keyword).
Case Insensitivity: Matches are found regardless of how the candidate capitalized the text.
Score Calculation: The system grants 10 points per unique keyword match, capped at a maximum score of 100%.

Supported Categories

The analyzer currently looks for:

Japanese Proficiency: JLPT levels (N1, N2, N3, N4, N5) and language skills.
Tech Stack: Programming languages (Python, JavaScript, Go), Frameworks (React, Node.js), and Databases.
Infrastructure: Cloud providers (AWS, Azure, GCP) and DevOps tools (Docker, Kubernetes).
Soft Skills: Leadership, communication, and project management.

Administrator View

Once the parsing is complete, the results are available in the Applicant Detail page within the Admin Module.

Match Score: Displayed as a visual percentage to help prioritize candidates.
Matched Keywords: A list of tags showing exactly which required skills were found in the resume.
Extracted Text: The raw text output from the PDF is stored and viewable, allowing administrators to search through the resume content without downloading the file.