This pipeline is designed to perform entity resolution using data collected from North Dakota Business Search. The pipeline consists of a web crawler to pull and parse data and an entity resolution (ER) service to visualize the relationship between entities.
Explore an interactive plot showcasing entity relationships.
Access the original data source from which the data is crawled to use for this entity resolution pipeline.
Find the crawled data used for the entity resolution process.
Make sure Docker Desktop is installed and running.
Configure search parameters and output file path in docker-compose.yml and run: [Default values form search param and output file path are set in services.py]
docker-compose run web_crawler
Configure the path for input dataset and output file path for the plot and run: [Default values form search param and output file path are set in services.py]
docker-compose run er
docker-compose build view_er
docker-compose up view_er
Access the ER visualization in your browser http://localhost:8000.
docker-compose run format
pip install poetry
cd entity_resolution_pipeline
poetry install
poetry run er_pipeline run_crawler
poetry run er_pipeline run_er
poetry run er_pipeline view_er_in_browser
If you want to configure custom parameters for any of the above services with Poetry, use the below command to view the configuration options for each service.
Default custom parameters for these services are configured in services.py
.
poetry run er_pipeline {service_name} --help
eg: poetry run er_pipeline run_crawler --help
poetry run black .