Figure extraction using deep neural nets.
deepfigures-open
is the companion code to the paper
Extracting Scientific Figures with Distantly Supervised Neural Networks.
It provides code to run our model and extract figures from PDFs,
as well as code for generating our training data.
The generated dataset used in our paper is available for download here.
Note: This is research code and is not intended for use in production.
Deepfigures depends on pdffigures2 for caption extraction. You must
compile the utility and place it into the bin/
directory:
git clone https://github.com/allenai/pdffigures2
cd pdffigures2
sbt assembly
mv target/scala-2.11/pdffigures2-assembly-0.0.12-SNAPSHOT.jar ../bin
cd ..
rm -rf pdffigures2
If the jar for pdffigures has a different name then
'pdffigures2-assembly-0.0.12-SNAPSHOT.jar'
, then adjust the
PDFFIGURES_JAR_NAME
parameter in deepfigures/settings.py
accordingly.
You have to download weights for the deepfigures model into this
repository in order to run it. You can download a tarball of the weights
here. Once you've downloaded the tarball, extract
it and place the weights/
directory in the root of this repository.
If you choose to name the weights directory something different, be sure
to update the TENSORBOX_MODEL
constant in deepfigures/settings.py
.
In deepfigures/settings.py
set the ARXIV_DATA_TMP_DIR
and
ARXIV_DATA_OUTPUT_DIR
variables to local directories on your
machine. Make sure that these directories have at least a few TBs of
storage since there are a lot of arXiv papers.
In deepfigures/settings.py
set the PUBMED_INPUT_DIR
,
PUBMED_INTERMEDIATE_DIR
, PUBMED_DISTANT_DATA_DIR
, and
LOCAL_PUBMED_DISTANT_DATA_DIR
to different directories.
PUBMED_INPUT_DIR
, PUBMED_INTERMEDIATE_DIR
, and
PUBMED_DISTANT_DATA_DIR
can be directories in S3, but
LOCAL_PUBMED_DISTANT_DATA_DIR
should be a local directory.
Additionally, PUBMED_INPUT_DIR
should have all of the
Pubmed Open Access subset papers split into
directories with the following structure:
xx/yy/example-pmc-data.tar.gz
Where xx
and yy
range from 00
to ff
.
Make sure you have docker installed and that you also have all the requirements installed:
pip install -r requirements.txt
Much of the functionality for this code requires usage of AWS (such as
downloading the data for arxiv). Make sure the deepfigures-local.env
file is filled out with your AWS credentials if you want to run with
this functionality. Please note that running this code with the AWS
functionality will incur charges on your AWS account.
The AWS integration is used for:
- downloading the arXiv data dump from S3 to generate the arXiv paper labels.
- storing intermediate computations in S3 while running the pubmed data pipeline.
For most use cases, users will prefer to download the dataset directly rather than rebuilding it themselves.
Use the manage.py
script in the root of this repository to view common
commands for development. To get a list of commands, run:
python manage.py --help
You'll see something like:
$ python manage.py --help
Usage: manage.py [OPTIONS] COMMAND [ARGS]...
A high-level interface to admin scripts for deepfigures.
Options:
-v, --verbose Turn on verbose logging for debugging purposes.
-l, --log-file TEXT Log to the provided file path instead of stdout.
-h, --help Show this message and exit.
Commands:
build Build docker images for deepfigures.
detectfigures Run figure extraction on the PDF at PDF_PATH.
generatearxiv Generate arxiv data for deepfigures.
generatepubmed Generate pubmed data for deepfigures.
testunits Run unit tests for deepfigures.
To learn more about a command, call it with the --help
option.
To extract figures from a PDF, use the detectfigures
command.
For questions, contact the authors of the paper Extracting Scientific Figures with Distantly Supervised Neural Networks.