Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chat Nio 配置自定义文件解析服务报错网络错误的可能情况 - possible errors for configuring the custom file parsing service #6

Closed
dqzboy opened this issue May 17, 2024 · 16 comments · Fixed by #13

Comments

@dqzboy
Copy link

dqzboy commented May 17, 2024

chatnio-blob-service部署完成后,前端访问上传图片正常,但是chatnio的后台配置地址上传图片直接报错,日志也没有看到相关的信息输出
image
image

chatnio-blob-service使用docker部署
chatnio通过docker部署或者编译部署都提示网络错误

@dqzboy dqzboy closed this as completed May 18, 2024
@dqzboy dqzboy reopened this May 18, 2024
@zmh-program
Copy link
Owner

Please provide your browser logs, the possible factors for this issue are:

  1. Mix Content Error: Browser does not allow HTTPS sites to initiate HTTP requests
  2. CORS Issue: This is not an issue by default, but if you have misconfigured cross-domain resource sharing, this may cause this issue.

@dqzboy
Copy link
Author

dqzboy commented May 30, 2024

谢谢,已解决

@dqzboy dqzboy closed this as completed May 30, 2024
@zmh-program
Copy link
Owner

marked as pinned issue

@zmh-program zmh-program changed the title chatnio后台配置此项目地址直接网络错误 Chat Nio 配置自定义文件解析服务报错网络错误的可能情况 - possible errors for configuring the custom file parsing service May 30, 2024
@zmh-program zmh-program pinned this issue May 30, 2024
@kitaev-chen
Copy link

谢谢,已解决

??? 怎么解决的

@zmh-program
Copy link
Owner

谢谢,已解决

??? 怎么解决的

我建议你提供一下你的信息,这样没有任何有用信息呢

@dqzboy
Copy link
Author

dqzboy commented Dec 3, 2024

谢谢,已解决

??? 怎么解决的

跨域问题,直接配置为公网地址或者nginx代理一下

@kitaev-chen
Copy link

kitaev-chen commented Dec 3, 2024

谢谢,已解决

??? 怎么解决的

跨域问题,直接配置为公网地址或者nginx代理一下

谢谢。如果只是想 localhost 的实验下,不知有没有解决方案。我是将 blob service 整合进 searxng 的 compose 文件里 (带caddy),然后设置 - CORS_ALLOW_ORIGINS=* ,结果还是不行。

@zmh-program
Copy link
Owner

F12,看一下请求报错的是什么。

@kitaev-chen
Copy link

F12,看一下请求报错的是什么。

多谢提醒,很有帮助。host.docker.internal 之类的问题,其实 compose 里已经改为局域网 ip 重启了,居然是 F5 刷新不行,需要关闭标签再打开。

小文件基本没问题了,9 页的 paper 还是报错:
400 Client Error: Bad Request for url: http://192.168.xxx:xxx/ocr/predict-by-file

@zmh-program
Copy link
Owner

zmh-program commented Dec 3, 2024

400 Client Error 这个是paddleocr-api的bug吧。

@kitaev-chen
Copy link

有可能,不知道能不能把 paddleocr 换成 zerox

@zmh-program
Copy link
Owner

来提一个新issue,也欢迎来pr。

@kitaev-chen
Copy link

400 Client Error 这个是 paddleocr-api 的 bug 吧。

搞清楚了,我测试的 pdf 抽取 image 之后有各种格式,比如 jpeg, jb2 什么的,前者 paddleocr-api 不支持,后者 blob-service 判断不是图片。难怪各种问题。

@zmh-program
Copy link
Owner

zmh-program commented Dec 4, 2024

400 Client Error 这个是 paddleocr-api 的 bug 吧。

搞清楚了,我测试的 pdf 抽取 image 之后有各种格式,比如 jpeg, jb2 什么的,前者 paddleocr-api 不支持,后者 blob-service 判断不是图片。难怪各种问题。

涨见识了。我修复一下后者。

@zmh-program zmh-program reopened this Dec 4, 2024
@zmh-program
Copy link
Owner

zmh-program commented Dec 4, 2024

不对不对,你给我绕晕了哈哈哈。这个问题应该不是 blob 的检测器问题。 (不过我再加下这些格式的支持, 直接上传还是会有影响的)

我回去看了一下pdf里抽取image的逻辑,调用get_images后没有做判断。

for image_instance in page.get_images(full=True): # get all images on the page
cursor += 1
xref = image_instance[0] # get the xref of the image
image = doc.extract_image(xref) # extract the image
data = image['image'] # get the image data
suffix = image.get('ext', '') # get the image extension
image_name = f"{filename}_extracted_{cursor}.{suffix}" # create a name for the image
io = BytesIO(data)
io.name = image_name
io.seek(0)
# create a file-like object for the image
image_file = UploadFile(io, filename=image_name)
stack.append(await process_image(image_file, enable_ocr=enable_ocr, enable_vision=enable_vision, not_raise=True))
print(f"[pdf] extracted image: {image_name} (page: {page.number}, cursor: {cursor}, max: {PDF_MAX_IMAGES})")
if PDF_MAX_IMAGES != -1 and cursor >= PDF_MAX_IMAGES:
break

image.py实现process的逻辑,并没有is_image的判断。

async def process(file: UploadFile, enable_ocr: bool, enable_vision: bool, not_raise: bool = False):
"""Process image."""
if enable_ocr:
return create_ocr_task(file)
if not enable_vision:
if not not_raise:
return ""
raise ValueError("Trying to upload image with Vision disabled.")
return await process_image(file)

至于image suffix checker在哪呢,应该是只有在processor.py中的switch实现里, process 是没有做判断的。

elif image.is_image(filename):
return "image", await image.process(
file,
enable_ocr=enable_ocr,
enable_vision=enable_vision,
)

@kitaev-chen
Copy link

哦哦,多谢啦!我也是粗略看了一眼,不过加上更全点也挺好的。不知道paddleocr那支持多少,我回头也改改PaddleOCRFastAPI那边看。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants