國家寶藏松 - 後端需求

最後編輯：2017-08-09 建立：2016-10-30 歷史紀錄

HSIN H志工發配 server

Given a naId (every document has a unique naId) query its detail here https://catalog.archives.gov/api/v1/?naId s ={naId}

check if `results.result[0].objects` exists; if yes, it has already been digitized and the image files are under `results.result[0].objects.object` array
if it hasn't been digitized, check if `results.result[0].description.series.fileUnit` exists and has a number > 1. If yes, that means there are multiple files under this naId and you need to query `https://catalog.archives.gov/api/v1/?description.fileUnit.parentSeries.naId={naId}` to get the sub files and their naIds.
Next, need to check `accessRestriction.status.termName` under `description` or `description.fileUnit` . If it's `restricted` then we don't want to dispatch this naId to volunteer.
If everything above checks out, check our flag with `unstarted` `started but incomplete` `complete`
dispatch unstarted and incomplete ones to volunteer client

NoSQL Table schema for TNT-Dispatch

uid: unique id for the dispatch
catalogId: unique id of the record in the TNT-Catalog
userId: unique id of the requesting user/volunteer
naId: associated naId of the record
createdAt
updatedAt
completedAt
status: enum['dispatched', 'complete', 'incomplete', 'error']

Endpoints of the Dispatch Server

Request a dispatch: [Endpoint]/request
Update a dispatch status: [Endpoint]/update

Document Index Server

NARA API Github (query example)

Nation Archive Api recorder (first 100 row only out of 12600)

https://github.com/hsin421/tw-national-treasure

Suggested by Simon Liu

Digital archive management system (open source): Fedora commons

OCR SERVER

OCR with Python and Google Cloud Vision API reference:

https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d

TI-YEN Lrepo:

https://github.com/tl578/g0v-nyc

申請API Key:

https://developers.google.com/api-client-library/python/guide/aaa_apikeys

初步test:

相對乾淨的文件：OCR 結果

大部分的內容都可以正確被抓到。

含手寫內容的文件：OCR結果

如果手寫內容太潦草或模糊，則沒辦法被抓到。

翻拍時有陰影的文件：OCR結果

可抓到陰影部分的字，但因翻拍角度不夠水平，有一些字和文件內的順序不一致。

可能解決方案：

寫一個app在OCR前先將翻拍文件旋轉一適當角度

或者可以先略過這個問題，因為字的順序不會影響關鍵字搜尋的結果。

雨蒼林網站要工程師嗎？

HSIN H12.10.2016 更新

OCR server is up and running at

https://nationa-treasure-vision.herokuapp.com/vision

Source Code

https://github.com/national-treasures-tw/vision

test by posting with { types: 'text', imageUrl: 'YOU_IMAGE_URL' }

please do not abuse this as we have only 1000 free request quota