Skip to content

Commit

Permalink
Setup a default hints value, add dev Dockerfile and conf, modify url …
Browse files Browse the repository at this point in the history
…hal contrsuction for both dev (medialab) and full (sciences po) dump getting
  • Loading branch information
jimenaRL committed Oct 17, 2023
1 parent bbf2edc commit 98709eb
Show file tree
Hide file tree
Showing 6 changed files with 60 additions and 4 deletions.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
*hnswlib.index
*hal-productions.json
*hal-productions_medialab.json

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
9 changes: 8 additions & 1 deletion Dockerfile_dev
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,14 @@ RUN apt-get update && \
RUN git clone https://github.com/medialab/halexp.git
WORKDIR /halexp

# download sBert models
RUN python -c "from sentence_transformers import SentenceTransformer; sBert = SentenceTransformer('distiluse-base-multilingual-cased-v1')"

ENV APPCONFIG=/halexp/config.yaml
ENV FLASK_APP=/halexp/python/halexp/app.py

#CMD ["bash", "start.sh"]
# get HAL dump
RUN python get_dump.py --config=config_dev.yaml

# run server
CMD ["flask", "run", "--host=0.0.0.0", "--port=80", "--debugger"]
2 changes: 2 additions & 0 deletions config.yaml
Original file line number Diff line number Diff line change
@@ -1,11 +1,13 @@
app:
default_nb_hits: 5
style:
imageWidth: 450
logoUrl: https://medialab.sciencespo.fr/static/logo_medialab_d4a4a5af-92bb-4651-97e7-22272a5a5d3f.png
corpus:
dump_file: hal-productions.json
baseUrl: https://api.archives-ouvertes.fr
portail: sciencespo
query: '*:*'
pagination_count: 10000
fields:
- sciencespoId_s
Expand Down
46 changes: 46 additions & 0 deletions config_dev.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
app:
default_nb_hits: 5
style:
imageWidth: 450
logoUrl: https://medialab.sciencespo.fr/static/logo_medialab_d4a4a5af-92bb-4651-97e7-22272a5a5d3f.png
corpus:
dump_file: hal-productions.json
baseUrl: https://api.archives-ouvertes.fr
portail: index
query: 'labStructId_i:394361'
pagination_count: 10000
fields:
- sciencespoId_s
- halId_s
- uri_s
- docType_s
- language_s
- title_s
- subtitle_s
- abstract_s
- description_s
- en_title_s
- en_subTitle_s
- en_abstract_s
- en_description_s
- fr_title_s
- fr_subTitle_s
- fr_abstract_s
- fr_description_s
- modifiedDate_s
- submittedDate_s
- releasedDate_s
- producedDate_s
- publicationDate_s
- ePublicationDate_s
- conferenceStartDate_s
- conferenceEndDate_s
- writingDate_s
- defenseDate_s
- authFirstName_s
- authLastName_s
- citationFull_s
index:
hnswlib_space: cosine
sentence_transformer_model: distiluse-base-multilingual-cased-v1

4 changes: 2 additions & 2 deletions get_dump.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,11 +18,11 @@

BASE_URL = params['baseUrl']
PAGINATION_COUNT = params['pagination_count']
QUERY = "*:*"
QUERY = params['query']
FL_PARAM = '&fl='+','.join(params['fields'])
PORTAIL = params['portail']

base_url = f"{BASE_URL}/search/{PORTAIL}/?q{QUERY}"
base_url = f"{BASE_URL}/search/{PORTAIL}/?q={QUERY}"
base_url += f"&wt=json&fl={FL_PARAM}"
base_url += f"&rows={PAGINATION_COUNT}&sort=docid+asc"

Expand Down
2 changes: 1 addition & 1 deletion python/halexp/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -75,7 +75,7 @@ def query():
return {'error': 'Missing `query` argument in query string'}
nb_hits = request.args.get('hits')
if nb_hits is None:
return {'error': 'Missing `hits` argument in query string'}
nb_hits = params['app']['default_nb_hits']

res = index.retrieve(query=query, top_k=castInt(nb_hits))

Expand Down

0 comments on commit 98709eb

Please sign in to comment.