文档上传
$content = file_get_contents($request->file('file'));
$base64content = base64_encode($content);
$client = ClientBuilder::create()
->setHosts(config("es.hosts"))->setSSLVerification(false)
->setApiKey(config("es.api"))
->build();
$param = [
'index' => 'docwrite',
'type' => '_doc',
'body' => [
"category" => $request->input("category")??"",
"name" => $request->file('file')->getClientOriginalName(),
"doc_type" => $request->file('file')->getClientOriginalExtension(),
"content" => $base64content
],
'pipeline' => 'attachment'
];
$response = $client->index($param);
OCR
安装Tesseract
brew install tesseract tesseract-lang
下载fscrawler,执行一个不存在的人物会提示新建任务:
bin/fscrawler test01
然后打开对应的配置文件,配置示例
---
name: "test01"
fs:
url: "/Users/yourname/Desktop/tmp"
update_rate: "1m"
includes:
- "*.pdf"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: false
lang_detect: false
continue_on_error: true
ocr:
language: "chi_sim"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "https://127.0.0.1:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"
ssl_verification: false
username: "elastic"
password: "password"
启动任务(重新执行,一次就结束)
bin/fscrawler test01 --loop 1 --restart