國家寶藏松 - 後端需求
志工發配 server
Given a naId (every document has a unique naId) query its detail here https://catalog.archives.gov/api/v1/?naIds={naId}
- check if `results.result[0].objects` exists; if yes, it has already been digitized and the image files are under `results.result[0].objects.object` array
- if it hasn’t been digitized, check if `results.result[0].description.series.fileUnit` exists and has a number > 1. If yes, that means there are multiple files under this naId and you need to query `https://catalog.archives.gov/api/v1/?description.fileUnit.parentSeries.naId={naId}` to get the sub files and their naIds.
- Next, need to check `accessRestriction.status.termName` under `description` or `description.fileUnit` . If it’s `restricted` then we don’t want to dispatch this naId to volunteer.
- If everything above checks out, check our flag with `unstarted` `started but incomplete` `complete`
- dispatch unstarted and incomplete ones to volunteer client
NoSQL Table schema for TNT-Dispatch
- uid: unique id for the dispatch
- catalogId: unique id of the record in the TNT-Catalog
- userId: unique id of the requesting user/volunteer
- naId: associated naId of the record
- createdAt
- updatedAt
- completedAt
- status: enum[’dispatched’, ’complete’, ’incomplete’, ’error’]
Endpoints of the Dispatch Server
- Request a dispatch: [Endpoint]/request
- Update a dispatch status: [Endpoint]/update
Document Index Server
NARA API Github (query example)
Nation Archive Api recorder (first 100 row only out of 12600)
https://github.com/hsin421/tw-national-treasure
Suggested by Simon Liu
Digital archive management system (open source): Fedora commons
OCR SERVER
OCR with Python and Google Cloud Vision API reference:
https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d
repo:
https://github.com/tl578/g0v-nyc
申請API Key:
https://developers.google.com/api-client-library/python/guide/aaa_apikeys
初步test:
- 可抓到陰影部分的字,但因翻拍角度不夠水平,有一些字和文件內的順序不一致。
- 或者可以先略過這個問題,因為字的順序不會影響關鍵字搜尋的結果。
網站要工程師嗎?
12.10.2016 更新
OCR server is up and running at
https://nationa-treasure-vision.herokuapp.com/vision
Source Code
https://github.com/national-treasures-tw/vision
test by posting with { types: ’text’, imageUrl: ’YOU_IMAGE_URL’ }
please do not abuse this as we have only 1000 free request quota