Enjoying what you find in this repository? Your star β would be greatly appreciated!
Weaviate Lexplorer is your 100% open-source go-to tool for deep insights from Lex Fridman's podcasts. Using hybrid search with Weaviate's vector database, it lets you dive into key discussions by analyzing podcast transcriptions in chunks. With a user-friendly input and slider interface, explore now and uncover the richness of Lex Fridman's podcasts.*
π See the app demonstration.
*Inspired by QuoteFinder.
Run the following command in the terminal to clone this repository:
git clone https://github.com/rokbenko/weaviate-lexplorer.git
Run the following command in the terminal to create Docker containers using the docker-compose.yaml
:
docker compose up -d
Important
Having Docker installed is a prerequisite. If you don't have Docker installed, install it.
Run the following command in the terminal to download Ollama embedding nomic-embed-text
LLM:
ollama pull nomic-embed-text:v1.5
Important
Having Ollama installed is a prerequisite. If you don't have Ollama installed, install it.
There are two datasets inside the data
directory:
podcast_dataset.csv
, downloaded from Kagglepodcast_links_dataset.csv
, created by the repository author
There are five JavaScript files inside the root directory:
weaviate_create_podcast_collection.mjs
weaviate_create_podcast_links_collection.mjs
weaviate_delete_collection.mjs
weaviate_iterate_collection.mjs
weaviate_list_all_collections.mjs
Only two out of five JavaScript files are mandatory to run.
Run the weaviate_create_podcast_collection.mjs
to create a Weaviate collection named Podcast from the podcast_dataset.csv
:
node weaviate_create_podcast_collection.mjs
Note
The Podcast collection creation will take a long time. It can take even a day because the script needs to chunk all podcast transcriptions and insert them into Weaviate. Chunking is necessary for two reasons:
- Without chunking, the context window limit of the
nomic-embed-text
LLM will be hit because Lex Fridman's podcasts are long. - Without chunking, there's too much information, and the vector loses specificity. Consequently, the core functionality of the app (i.e., getting deep insights from Lex Fridman's podcasts) can't be achieved. Read more about why chunking is important.
Run the weaviate_create_podcast_links_collection.mjs
to create a Weaviate collection named Podcast_links from the podcast_links_dataset.csv
:
node weaviate_create_podcast_links_collection.mjs
The other three JavaScript files are meant for manipulating Weaviate collections:
weaviate_delete_collection.mjs
can be used to delete a Weaviate collection.weaviate_iterate_collection.mjs
can be used to iterate an already created Weaviate collection.weaviate_list_all_collections.mjs
can be used to list all already created Weaviate collections.
Important
The two out of three JavaScript files use the .env
file. So, before you run weaviate_delete_collection.mjs
or weaviate_iterate_collection.mjs
, make sure you set up the .env
file.
Your .env
file should contain the following environment variable:
WEAVIATE_COLLECTION_NAME=xxxxx
For example, if you set the WEAVIATE_COLLECTION_NAME
environment variable to WEAVIATE_COLLECTION_NAME=Podcast_test
, it means that running the two JavaScript files will:
- Delete the collection named Podcast_test.
- Iterate items in the collection named Podcast_test.
Important
Additionally, before you run weaviate_delete_collection.mjs
or weaviate_iterate_collection.mjs
, make sure to install dependencies by running the following command in the terminal:
npm install dotenv csv-parse @langchain/textsplitters
Run the following command in the terminal to change the directory:
cd client
Run the following command to install all Next.js dependencies:
npm install
Run the following command to run the Next.js development server:
npm run dev
To view the app, navigate to http://localhost:3000.
Important
All three Docker containers (i.e., weaviate-1
, ollama-1
, and text-spellcheck-1
) and Ollama should be running while using the app.
Weaviate Lexplorer leverages advanced hybrid search technology to provide in-depth insights from Lex Fridman's podcast episodes. Here's a detailed breakdown of how it works:
- Data collection and preparation
- Podcast transcriptions: The core data for Weaviate Lexplorer is the transcriptions of Lex Fridman's podcast episodes. Each transcription is broken down into smaller, manageable chunks for more detailed analysis.
- Weaviate vector database: The chunks of transcriptions are stored in the Weaviate vector database, which enables efficient and advanced search capabilities.
- Hybrid search
- Keyword search: This traditional search method looks for exact matches of the search terms within the chunks of the podcast transcriptions.
- Vector search: This method leverages the Weaviate vector database to find semantically similar chunks to the search terms, even if the exact keywords are not present. It uses machine learning models to understand the context and meaning of the words.
- Alpha parameter: The alpha parameter allows the user to balance the importance of keyword search results versus vector search results. A slider control in the user interface lets users adjust this parameter to fine-tune their search.
- User interaction
- Search input: Users can type in their search terms into a simple input field.
- Alpha slider: Users can adjust a slider to set the alpha parameter, which determines the mix between keyword and vector search results.
- Search execution: When the search is executed (either by pressing enter or clicking a search button), the application sends a query to the Weaviate vector database, which processes both keyword and vector searches based on the provided alpha value.
- Results, processing, and display
- Relevance scoring: The results are scored based on their relevance, combining scores from both keyword and vector searches. The higher the score, the more relevant the chunk is to the user's query.
- Loading skeleton: While the search is being processed, loading skeleton cards are displayed to indicate that the application is working on retrieving the results.
- Displaying results: Once the search is complete, the results are displayed as a list of podcast episodes. Each episode is shown with its relevant transcription chunks highlighted, providing users with quick and easy access to the most relevant parts of the podcasts.
- Additional features
- Link integration: For each podcast episode, relevant links (e.g., YouTube links) are displayed. If a link is marked as "Removed" in the database, the application will show a message indicating that the podcast was removed.
- User experience: The application uses Material UI components to ensure a responsive and aesthetically pleasing user interface.
By breaking down podcast transcriptions into chunks and leveraging hybrid search technology, Weaviate Lexplorer offers a powerful and user-friendly way to explore and discover insights from Lex Fridman's podcast episodes.
Weaviate Lexplorer works with the following tech stack:
Tech | Version |
---|---|
Docker Desktop | 4.26.1 |
Ollama | 0.1.48 |
Node.js | 21.2.0 |
CSV parse | 5.5.6 |
LangChain text splitters | 0.0.3 |
Weaviate JS/TS client | 3.0.8 |
Dotenv | 16.4.5 |
Next.js | 14.2.4 |
React | ^18 |
TypeScript | ^5 |
Tailwind CSS | ^3.4.4 |
SASS | ^1.77.6 |
Material UI Next.js | ^5.15.11 |
Material Lab | ^5.0.0-alpha.171 |
Material Icons | ^5.16.0 |
ESLint | ^8 |
- The Lex Fridman podcast transcript dataset is not up-to-date. The latest transcription is for podcast #325. Take this into account when using the app because Weaviate Lexplorer will only be able to search for podcast episodes up to #325.
- Weaviate Lexplorer is not deployed on Vercel because it works with a local Weaviate setup using Docker.
- The search feature takes longer to give search results because of the Podcast_link collection. Removing the link feature would make the search much faster, but one of the core features (i.e., simply clicking the link to start watching the YouTube podcast episode instead of manually searching for it on Lex Fridman's YouTube channel) would be lost. Having links in the Lex Fridman podcast transcript dataset would be the best solution to speed up the search feature. In this case, there would be only the Podcast Weaviate collection with YouTube links added.
Contributions are welcome! Feel free to open issues or create pull requests for any improvements or bug fixes.
This project is open source and available under the MIT License.