Semantic keyword clustering in Python groups keywords by meaning using NLP embeddings rather than shared characters or text overlap. Python has become essential for semantic keyword clustering. Tools like spaCy and BERT embeddings enable marketers to efficiently analyze large quantities of keywords. The 4 main methods are TF-IDF with DBSCAN, Sentence Transformers with K-Means, Sentence Transformers with HDBSCAN, and SERP-based clustering.
What Is Semantic Keyword Clustering in Python?
Google Search documentation covers the official details in Creating helpful, reliable, people-first content.
Semantic keyword clustering is the process of grouping keywords that share the same meaning, search intent, or topic into clusters using machine learning algorithms. Python handles this by converting keywords into numerical vectors called embeddings, then applying a clustering algorithm to group similar vectors.
Traditional NLP libraries cluster based on keyword frequency and similarity. The results make sense mathematically but often fail to capture meaningful relationships between terms. For instance, "apple pie recipe" and "apple store" might be grouped together because they share the word "apple," while missing the link between "apple pie recipe" and "baking."
AI Mode SEO Checkers: 7 Tools, What They Track, and How to Use Them
Top of the Funnel Keywords: Definition, 5 Types, Examples, and SEO Strategy
How Is Semantic Clustering Different from Text-Based Grouping?
Text-based grouping matches keywords by shared characters or n-grams. Semantic clustering matches keywords by contextual meaning using transformer models. KeyBERT leverages the powerful contextual embeddings of BERT to identify words or phrases that are most relevant to the content, unlike traditional keyword extraction methods that rely on statistical or linguistic approaches.
What Python Libraries Are Needed for Semantic Keyword Clustering?
Requirements for running semantic keyword clustering include Python 3.7 or later, along with libraries such as pandas, numpy, sentence-transformers, bertopic, and openpyxl. Optional tools like spaCy or NLTK can also be utilized for preprocessing tasks.
Library | Function | Install Command sentence-transformers | Generates semantic embeddings from keywords | pip install sentence-transformers scikit-learn | Provides K-Means, DBSCAN, and TF-IDF | pip install scikit-learn hdbscan | Hierarchical density-based clustering | pip install hdbscan pandas | Handles keyword CSV input and output | pip install pandas numpy | Numerical operations on embedding vectors | pip install numpy bertopic | Topic modeling combined with clustering | pip install bertopic keybert | BERT-based keyword extraction | pip install keybert
Which Python Version Does Semantic Keyword Clustering Require?
Python 3.9 or later is required for semantic keyword clustering. It is available for Mac, Windows, and Linux from the official Python site. Google Colab provides a free browser-based environment for running clustering scripts without local setup.
What Are the 4 Methods of Semantic Keyword Clustering in Python?
Method 1: TF-IDF Vectorization with DBSCAN
TF-IDF with DBSCAN is the fastest and simplest method. TfidfVectorizer creates a feature vector over all queries. Clustering algorithms work with numbers, so every keyword is transformed into a word vector containing every stemmed word found in the input keyword set with its TF-IDF weights.
DBSCAN does not require estimating a good number of clusters (k) using the elbow method. Keywords that belong to the same group are concatenated together with a pipe delimiter in the output file.
from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.cluster import DBSCAN import pandas as pd
df = pd.read_csv("keywords.csv") keywords = df["keyword"].tolist() vectorizer = TfidfVectorizer() X = vectorizer.fit_transform(keywords) model = DBSCAN(eps=0.5, min_samples=2, metric="cosine") labels = model.fit_predict(X) df["cluster"] = labels df.to_csv("clustered_keywords.csv", index=False)
Method 2: Sentence Transformers with K-Means
Sentence Transformers generate 384-dimensional dense vector embeddings that capture semantic meaning. The sentence-transformer maps sentences and paragraphs to a 384-dimensional dense vector space and can be used for tasks like clustering or semantic search.
K-Means requires a predefined k value. Use the silhouette score method to find the optimal k before running the final clustering.
Method 3: Sentence Transformers with HDBSCAN (Recommended)

DBSCAN and HDBSCAN are better for irregularly shaped clusters and automatically detect noise, which are outliers. Dimensionality reduction using PCA or UMAP can improve clustering performance by reducing noise in high-dimensional embeddings.
HDBSCAN assigns a label of -1 to noise points. These are keywords that do not fit any cluster and require manual review.
Method 4: SERP-Based Clustering Using Google Results
SERP-based clustering uses Google's NLP blackbox instead of building custom models. If the same pages rank for different keywords, those keywords are semantically related. A graph is created using the relationship between keywords and ranking pages.
By scraping SERP results and leveraging connections between ranking pages, clusters that reflect real-world search intent are uncovered. This approach uses Python libraries including networkx for graph-based clustering and SQLite for storing SERP data.
This method requires a Google Custom Search API key. The free quota is 100 requests per day. The paid plan costs $5 per 1,000 requests.
How Do You Choose Between K-Means, DBSCAN, and HDBSCAN?
Algorithm | Cluster Count Required | Handles Noise | Best For K-Means | Yes (define k) | No | Balanced clusters of similar size DBSCAN | No (automatic) | Yes | Irregular cluster shapes HDBSCAN | No (automatic) | Yes | Large keyword sets with varied densities
Start with simple implementations using Sentence-BERT and K-Means clustering. Gradually incorporate advanced techniques like DBSCAN and hierarchical clustering as requirements grow.
HDBSCAN is the recommended method for keyword sets above 1,000 terms. K-Means is suitable for smaller lists where the approximate number of topic clusters is known in advance.
How Do Semantic Keyword Clusters Improve SEO Content Strategy?
By 2025, Google's Knowledge Graph processes 800 billion facts across 8 billion entities, requiring SEO professionals to leverage NLP techniques to map content to these relationships.
Semantic keyword clusters improve SEO content strategy in 3 ways:
- Each cluster maps to 1 content page, reducing keyword cannibalization across the site
- Cluster labels reveal topic gaps where no page currently exists on the site
- Clusters align content structure with how Google groups search intent, improving topical authority
The script is optimized for handling large datasets by dividing data into blocks, ensuring efficient memory usage. It supports tasks including keyword clustering for SEO analysis, semantic topic modeling, intent analysis in multilingual datasets, and preprocessing for machine learning models.
How Do You Export Keyword Clusters for Content Planning?
Export cluster results to CSV or Excel for content team use. Add 3 columns to the output file:
- cluster: the numeric cluster ID assigned by the algorithm
- cluster_label: a descriptive name for the cluster topic (assigned manually or via BERTopic)
- content_url: the target URL for the page that will cover that cluster
This structure maps every keyword to a specific page, creates a complete content brief input, and prevents multiple pages from targeting the same search intent.

Waleed Qamar holds a BSc in Computer Science from Purdue University and has spent the years since turning that technical foundation into something the curriculum never covered: figuring out why websites rank, why they fall, and why most businesses never find out until it is too late.
Pakistan-born and based between the United States and South Asia, he has managed search visibility for e-commerce stores, local service businesses, and SaaS startups across two continents. He started in SEO when guest posting still worked, survived the Penguin update, and has rebuilt client sites from scratch after algorithm hits more than once.
He has watched good businesses get sold packages that looked like progress and delivered nothing lasting. He has also seen the right approach quietly double a site’s traffic without a single press release about it.
His writing on SEO By Highsoftware99 covers Google algorithm updates, autocomplete optimization, semantic SEO structure, and the widening gap between what agencies promise and what Google actually rewards in 2026.
He knows what a traffic cliff looks like in Search Console on the morning you discover it.

