mirror of https://github.com/anisotropi4/nesa

anisotropi4 b403408573 initial commit		8 months ago
Anglia	NESA June 2023 data	1 year ago
Kent-Sussex-Wessex	NESA June 2023 data	1 year ago
London-North-Eastern	NESA June 2023 data	1 year ago
London-North-Western-North	NESA June 2023 data	1 year ago
London-North-Western-South	NESA June 2023 data	1 year ago
Scotland	NESA June 2023 data	1 year ago
Western	NESA June 2023 data	1 year ago
archive	June 2021 update	4 years ago
download	update March 2022	3 years ago
.gitignore	Initial commit	5 years ago
LICENSE	Initial commit	5 years ago
README.md	Update README.md	2 years ago
collate-gauge.py	update pandas concat not append	3 years ago
collate-gauge2.py	tweaks to run.sh and update rectangle find	1 year ago
filter_pdf.py	update removes light greyscale to white and dark greyscale to black	2 years ago
format-md.py	initial commit	3 years ago
generate-png.sh	initial commit	1 year ago
generate-txt.sh	initial commit	8 months ago
pdf-separate.sh	initial commit	1 year ago
remove-background.sh	comment out debug	2 years ago
requirements.txt	update requirements	2 years ago
run.sh	tweaks to run.sh and update rectangle find	1 year ago

README.md

nesa

This project consists of a series of scripts that extract data from the Network Rail National Electronic Sectional Appendix (NESA) into a series of Route Clearance reports using PDF text extraction.

The downloadable NESA data is available here and contains as a set of route PDF files with spreadsheet and embedded TIFF image files

Extracted data download links

Unformatted text	Per page Route Clearance TSV	Route Clearance XLSX Report
Anglia Route	Anglia Route	Anglia Route
Kent, Sussex and Wessex	Kent, Sussex and Wessex	Kent, Sussex and Wessex
London North-Eastern	London North-Eastern	London North-Eastern
London North-Western North	London North-Western North	London North-Western North
London North-Western South	London North-Western South	London North-Western South
Scotland	Scotland	Scotland
Western	Western	Western

Notes: South Wales data is now in Western, and North Wales in London North Western North. Kent, Sussex and Wessex data is now back in the Kent-Sussex-Wessex directory.

Data Source

The PDF files for these seven routes are available here

Prerequisites

jq is a lightweight and flexible command-line JSON processor. On an Debian or similar apt based Linux system:

$ sudo apt install jq
poppler-utils package to decompress, extract text and render PDF based on the xpdf-3.0 code base

$ sudo apt install poppler-utils
ghostscript package to interpret and manipulate PostScript and PDF files

$ sudo apt install ghostscript

`python` dependencies

python 3.9 to run the scripts PDF based on the xpdf-3.0 code base. Tested against Python 3.7, 3.8 and 3.9
Python pandas data processing library
Python pdfplumber table and visual debugging PDF data extract library
Python pdfminer.six PDF information extraction library
Python openpyxl library to write Excel 2010 xlsx files

`python virtualenv` package

For ease of use manage python packages dependencies with a local virtual environment venv using the python virtualenv package:

$ sudo apt install virtualenv
$ virtualenv venv
$ source venv/bin/activate
$ pip install -r requirements.txt

Creating the Route Clearance reports

The reports for the routes are created as follows:

Download the data

Download the seven route Section Appendix PDF files into the download directory from here

Process the PDF files

To extract the data execute the run.sh script:

$ ./run.sh

This executes a series of scripts to segment, extract and output the data creating a series of TSV and Excel spreadsheets in the seven route directories

How it works

To extract text from the PDF text-object elements, issues with formatting and use of grey-scale background in a number of the key route-clearance tables breaks pdfplumber and pdfminer formatted text extraction.

To overcome this the PDF files are converted to an uncompressed CMYK PDF/A format, and the grey background removed by deleting the call and graphic state for the embedded grey background image. Out-with that it seems to work, this is in no way a recommended approach.

It creates broken PDF files, as the internal PDF checksums no longer match. It assumes the background grey colour is encoded as 0.8081 g or 1 1 0 rg and rendered using the call to f*. Were the PDF rendering software used by Network Rail, Ghostscript, or qpdf to change this would just break. YMMV

License

Network Rail are copyright holder and retain all intellectual property rights related to the data and derived data contained within the National Electronic Sectional Appendix as set out here

The scripts and other material is provided under the the terms set out in the LICENSE

Acknowledgement

The authors would like to thank Network Rail for providing this data and to all the contributors to the tools and libraries used