第2世界
发布于 2023-06-27 / 6 阅读 / 0 评论 / 0 点赞

elasticsearch 索引文档

文档上传

$content = file_get_contents($request->file('file'));
$base64content = base64_encode($content);

$client = ClientBuilder::create()
    ->setHosts(config("es.hosts"))->setSSLVerification(false)
    ->setApiKey(config("es.api"))
    ->build();

$param = [
    'index' => 'docwrite',
    'type' => '_doc',
    'body' => [
        "category" => $request->input("category")??"",
        "name" => $request->file('file')->getClientOriginalName(),
        "doc_type" => $request->file('file')->getClientOriginalExtension(),
        "content" => $base64content
    ],
    'pipeline' => 'attachment'

];
$response = $client->index($param);

OCR

安装Tesseract

brew install tesseract tesseract-lang

下载fscrawler,执行一个不存在的人物会提示新建任务:

bin/fscrawler test01

然后打开对应的配置文件,配置示例

---
name: "test01"
fs:
  url: "/Users/yourname/Desktop/tmp"
  update_rate: "1m"
  includes:
    - "*.pdf"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: false
  lang_detect: false
  continue_on_error: true
  ocr:
    language: "chi_sim"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
    - url: "https://127.0.0.1:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"
  ssl_verification: false

  username: "elastic"
  password: "password"

启动任务(重新执行,一次就结束)

bin/fscrawler test01 --loop 1 --restart


评论