國家寶藏松 - 後端需求

最後編輯:2017-08-09 建立:2016-10-30 歷史紀錄

 

HSIN H志工發配 server

 

Given a naId (every document has a unique naId) query its detail here https://catalog.archives.gov/api/v1/?naIds={naId}

  1. check if `results.result[0].objects` exists; if yes, it has already been digitized and the image files are under `results.result[0].objects.object` array
  2. if it hasn't been digitized, check if `results.result[0].description.series.fileUnit` exists and has a number > 1. If yes, that means there are multiple files under this naId and you need to query `https://catalog.archives.gov/api/v1/?description.fileUnit.parentSeries.naId={naId}` to get the sub files and their naIds.
  3. Next, need to check `accessRestriction.status.termName` under `description` or `description.fileUnit` . If it's `restricted` then we don't want to dispatch this naId to volunteer.
  4. If everything above checks out, check our flag with `unstarted` `started but incomplete` `complete`
  5. dispatch unstarted and incomplete ones to volunteer client

 

NoSQL Table schema for TNT-Dispatch

  • uid: unique id for the dispatch
  • catalogId: unique id of the record in the TNT-Catalog
  • userId: unique id of the requesting user/volunteer
  • naId: associated naId of the record
  • createdAt
  • updatedAt
  • completedAt
  • status: enum['dispatched', 'complete', 'incomplete', 'error']

 

Endpoints of the Dispatch Server

  • Request a dispatch: [Endpoint]/request
  • Update a dispatch status: [Endpoint]/update

 

 

Document Index Server

 

NARA API Github (query example)

Nation Archive Api recorder (first 100 row only out of 12600)

https://github.com/hsin421/tw-national-treasure

 

Suggested by Simon Liu

Digital archive management system (open source): Fedora commons

 

OCR SERVER

 

OCR with Python and Google Cloud Vision API reference:

https://gist.github.com/dannguyen/a0b69c84ebc00c54c94d

TI-YEN Lrepo:

https://github.com/tl578/g0v-nyc

 

申請API Key:

https://developers.google.com/api-client-library/python/guide/aaa_apikeys

 

初步test:

  • 大部分的內容都可以正確被抓到。
  • 如果手寫內容太潦草或模糊,則沒辦法被抓到。
  • 可抓到陰影部分的字,但因翻拍角度不夠水平,有一些字和文件內的順序不一致。
    • 可能解決方案:
      • 寫一個app在OCR前先將翻拍文件旋轉一適當角度
  • 或者可以先略過這個問題,因為字的順序不會影響關鍵字搜尋的結果。

 

雨蒼 林網站要工程師嗎?

 

HSIN H12.10.2016 更新

 

OCR server is up and running at

https://nationa-treasure-vision.herokuapp.com/vision

 

Source Code

https://github.com/national-treasures-tw/vision

 

test by posting with { types: 'text', imageUrl: 'YOU_IMAGE_URL' }

please do not abuse this as we have only 1000 free request quota