anisotropi4
b403408573
|
8 months ago | |
---|---|---|
Anglia | 1 year ago | |
Kent-Sussex-Wessex | 1 year ago | |
London-North-Eastern | 1 year ago | |
London-North-Western-North | 1 year ago | |
London-North-Western-South | 1 year ago | |
Scotland | 1 year ago | |
Western | 1 year ago | |
archive | 4 years ago | |
download | 3 years ago | |
.gitignore | 5 years ago | |
LICENSE | 5 years ago | |
README.md | 2 years ago | |
collate-gauge.py | 3 years ago | |
collate-gauge2.py | 1 year ago | |
filter_pdf.py | 2 years ago | |
format-md.py | 3 years ago | |
generate-png.sh | 1 year ago | |
generate-txt.sh | 8 months ago | |
pdf-separate.sh | 1 year ago | |
remove-background.sh | 2 years ago | |
requirements.txt | 2 years ago | |
run.sh | 1 year ago |
README.md
nesa
This project consists of a series of scripts that extract data from the Network Rail National Electronic Sectional Appendix (NESA) into a series of Route Clearance reports using PDF text extraction.
The downloadable NESA data is available here and contains as a set of route PDF files with spreadsheet and embedded TIFF image files
Extracted data download links
Unformatted text | Per page Route Clearance TSV | Route Clearance XLSX Report |
---|---|---|
Anglia Route | Anglia Route | Anglia Route |
Kent, Sussex and Wessex | Kent, Sussex and Wessex | Kent, Sussex and Wessex |
London North-Eastern | London North-Eastern | London North-Eastern |
London North-Western North | London North-Western North | London North-Western North |
London North-Western South | London North-Western South | London North-Western South |
Scotland | Scotland | Scotland |
Western | Western | Western |
- Notes: South Wales data is now in
Western
, and North Wales inLondon North Western North
. Kent, Sussex and Wessex data is now back in theKent-Sussex-Wessex
directory.
Data Source
The PDF files for these seven routes are available here
Prerequisites
-
jq is a lightweight and flexible command-line JSON processor. On an Debian or similar
apt
based Linux system:$ sudo apt install jq
-
poppler-utils package to decompress, extract text and render PDF based on the xpdf-3.0 code base
$ sudo apt install poppler-utils
-
ghostscript package to interpret and manipulate PostScript and PDF files
$ sudo apt install ghostscript
python
dependencies
- python 3.9 to run the scripts PDF based on the xpdf-3.0 code base. Tested against Python 3.7, 3.8 and 3.9
- Python pandas data processing library
- Python pdfplumber table and visual debugging PDF data extract library
- Python pdfminer.six PDF information extraction library
- Python openpyxl library to write
Excel 2010 xlsx
files
python virtualenv
package
For ease of use manage python
packages dependencies with a local virtual environment venv
using the python virtualenv
package:
$ sudo apt install virtualenv
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt
Creating the Route Clearance reports
The reports for the routes are created as follows:
Download the data
Download the seven route Section Appendix PDF files into the download
directory from here
Process the PDF files
To extract the data execute the run.sh
script:
$ ./run.sh
This executes a series of scripts to segment, extract and output the data creating a series of TSV
and Excel
spreadsheets in the seven route directories
How it works
To extract text from the PDF text-object elements, issues with formatting and use of grey-scale background in a number of the key route-clearance tables breaks pdfplumber
and pdfminer
formatted text extraction.
To overcome this the PDF files are converted to an uncompressed CMYK PDF/A format, and the grey background removed by deleting the call and graphic state for the embedded grey background image. Out-with that it seems to work, this is in no way a recommended approach.
It creates broken PDF files, as the internal PDF checksums no longer match. It assumes the background grey colour is encoded as 0.8081 g
or 1 1 0 rg
and rendered using the call to f*
. Were the PDF rendering software used by Network Rail, Ghostscript, or qpdf
to change this would just break. YMMV
License
Network Rail are copyright holder and retain all intellectual property rights related to the data and derived data contained within the National Electronic Sectional Appendix as set out here
The scripts and other material is provided under the the terms set out in the LICENSE
Acknowledgement
The authors would like to thank Network Rail for providing this data and to all the contributors to the tools and libraries used