Scans PDFs for links written in plaintext and checks if they are active or returns an error code. It then generates a report of its findings. Extract references (PDF, URL, DOI, arXiv) and metadata from a PDF.
New in v5.2.2: Retraction checking! linkrot now automatically checks DOIs against retraction databases to identify potentially retracted papers, helping ensure research integrity.
Check out our sister project, Rotting Research, for a web app implementation of this project.
Grab a copy of the code with pip:
pip install linkrot
For Debian/Ubuntu systems, you can build and install a .deb package:
# Install build dependencies
sudo apt-get install dpkg-dev debhelper dh-python python3-setuptools
# Build the package
python3 setup-deb-build.py
./build-deb.sh
# Install the packages
sudo dpkg -i ../python3-linkrot_*.deb ../linkrot_*.deb
sudo apt-get install -f  # Fix any dependency issues
See debian/README.md for detailed packaging instructions.
linkrot can be used to extract info from a PDF in two ways:
linkrotimport linkrotlinkrot [pdf-file-or-url]
Run linkrot -h to see the help output:
linkrot -h
usage:
linkrot [-h] [-d OUTPUT_DIRECTORY] [-c] [-r] [-j] [-v] [-t] [-o OUTPUT_FILE] [--version] pdf
Extract metadata and references from a PDF, and optionally download all referenced PDFs.
pdf (Filename or URL of a PDF file)
-h, --help            (Show this help message and exit)  
-d OUTPUT_DIRECTORY,  --download-pdfs OUTPUT_DIRECTORY (Download all referenced PDFs into specified directory)  
-c, --check-links     (Check for broken links)  
-r, --check-retractions (Check DOIs for retracted papers)
-j, --json            (Output infos as JSON (instead of plain text))  
-v, --verbose         (Print all references (instead of only PDFs))  
-t, --text            (Only extract text (no metadata or references))  
-a, --archive	  (Archive actvice links)
-o OUTPUT_FILE,        --output-file OUTPUT_FILE (Output to specified file instead of console)  
--version             (Show program's version number and exit)  
For testing purposes, you can find PDF samples in shared MEGA folder](https://mega.nz/folder/uwBxVSzS#lpBtSz49E9dqHtmrQwp0Ig).
linkrot https://example.com/example.pdf -t
linkrot https://example.com/example.pdf -t -o pdf-text.txt
linkrot https://example.com/example.pdf -c
linkrot https://example.com/example.pdf -r
linkrot https://example.com/example.pdf -c -r
linkrot https://example.com/example.pdf -r -j
Import the library:
import linkrot
Create an instance of the linkrot class like so:
pdf = linkrot.linkrot("filename-or-url.pdf") #pdf is the instance of the linkrot class
Now the following function can be used to extract specific data from the pdf:
Arguments: None
Usage:
metadata = pdf.get_metadata() #pdf is the instance of the linkrot class
Return type: Dictionary <class 'dict'>
Information Provided: All metadata, secret metadata associated with the PDF including Creation date, Creator, Title, etc…
Arguments: None
Usage:
text = pdf.get_text() #pdf is the instance of the linkrot class
Return type: String <class 'str'>
Information Provided: The entire content of the PDF in string form.
Arguments:
reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.
sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.
Usage:
references_list = pdf.get_references() #pdf is the instance of the linkrot class
Return type: Set <class 'set'> of <linkrot.backends.Reference object>
linkrot.backends.Reference object has 3 member variables:
- ref: actual URL/PDF/DOI/ARXIV
- reftype: type of reference
- page: page on which it was referenced
Information Provided: All references with their corresponding type and page number.
Arguments:
reftype: The type of reference that is needed 
	 values: 'pdf', 'url', 'doi', 'arxiv'. 
	 default: Provides all reference types.
sort: Whether reference should be sorted or not
      values: True or False. 
      default: Is not sorted.
Usage:
references_dict = pdf.get_references_as_dict() #pdf is the instance of the linkrot class
Return type: Dictionary <class 'dict'> with keys ‘pdf’, ‘url’, ‘doi’, ‘arxiv’ that each have a list <class 'list'> of refs of that type.
Information Provided: All references in their corresponding type list.
Arguments:
target_dir: The path of the directory to which the reference PDFs should be downloaded 
Usage:
pdf.download_pdfs("target-directory") #pdf is the instance of the linkrot class
Return type: None
Information Provided: Downloads all the reference PDFs to the specified directory.
Import:
from linkrot.downloader import sanitize_url, get_status_code, check_refs
Arguments:
url: The url to be sanitized.
Usage:
new_url = sanitize_url(old_url) 
Return type: String <class 'str'>
Information Provided: URL is prefixed with ‘http://’ if it was not before and makes sure it is in utf-8 format.
Arguments:
url: The url to be checked for its status. 
Usage:
status_code = get_status_code(url) 
Return type: String <class 'str'>
Information Provided: Checks if the URL is active or broken.
Arguments:
refs: set of linkrot.backends.Reference objects
verbose: whether it should print every reference with its code or just the summary of the link checker
max_threads: number of threads for multithreading
Usage:
check_refs(pdf.get_references()) #pdf is the instance of the linkrot class
Return type: None
Information Provided: Prints references with their status code and a summary of all the broken/active links on terminal.
Import:
from linkrot.extractor import extract_urls, extract_doi, extract_arxiv
Get pdf text:
text = pdf.get_text() #pdf is the instance of the linkrot class
Arguments:
text: String of text to extract urls from
Usage:
urls = extract_urls(text)
Return type: Set <class 'set'> of URLs
Information Provided: All URLs in the text
Arguments:
text: String of text to extract arXivs from
Usage:
arxiv = extract_arxiv(text)
Return type: Set <class 'set'> of arxivs
Information Provided: All arXivs in the text
Arguments:
text: String of text to extract DOIs from
Usage:
doi = extract_doi(text)
Return type: Set <class 'set'> of DOIs
Information Provided: All DOIs in the text
Import:
from linkrot.retraction import check_dois_for_retractions, RetractionChecker
Arguments:
dois: Set of DOI strings to check for retractions
verbose: Whether to print detailed results
Usage:
# Get DOIs from PDF text
text = pdf.get_text()
dois = extract_doi(text)
# Check for retractions
result = check_dois_for_retractions(dois, verbose=True)
Return type: Dictionary with retraction results and summary
Information Provided: Checks each DOI against retraction databases and provides detailed information about any retracted papers found.
For more advanced usage, you can use the RetractionChecker class directly:
checker = RetractionChecker()
# Check individual DOI
result = checker.check_doi("10.1000/182")
# Check multiple DOIs
results = checker.check_multiple_dois({"10.1000/182", "10.1038/nature12373"})
# Get summary
summary = checker.get_retraction_summary(results)
The retraction checker uses multiple methods to detect retractions:
To view our code of conduct please visit our Code of Conduct page.
This program is licensed with an GPLv3 License.