ratansunpy.scrapper package

Submodules

ratansunpy.scrapper.scrapper module

class ratansunpy.scrapper.scrapper.Scrapper(baseurl: str, regex_pattern: str | None = None, condition: Callable[[str, str, str], str] | None = None, filter: Callable[[str], bool] | None = None, **kwargs: Any)[source]

Bases: object

check_date_in_timerange_from_file_date(file_date: str, timerange: TimeRange) bool[source]

Check if a given file date is within the specified time range.

Parameters:
  • file_date – The file date as a string (format: “%Y-%m-%d”).

  • timerange – The TimeRange object representing the time range.

Returns:

True if the date is within the range, False otherwise.

check_date_in_timerange_from_url(url: str, timerange: TimeRange) bool[source]

Check if the date extracted from a URL is within the given time range.

Parameters:
  • url – The URL string.

  • timerange – The TimeRange object representing the time range.

Returns:

True if the date is within the range, False otherwise.

extract_date_from_url(url)[source]

Extract date from a given URL based on the base URL’s pattern.

Parameters:

url – The URL string.

Returns:

The extracted Time object.

static floor_datetime(date: Time, timestep: relativedelta) datetime[source]

Floor the given datetime to the nearest significant time unit.

Parameters:
  • date – The Time object to floor.

  • timestep – The relativedelta object representing the smallest significant time unit.

Returns:

The floored datetime object.

form_fileslist(timerange: TimeRange) List[str][source]

Retrieve a list of files from an HTTP or FTP server within the specified time range.

Parameters:

timerange (TimeRange) – The TimeRange object representing the time range.

Returns:

A list of file URLs.

Example:

usage example based on SWPC Solar Region Summary (FTP server)

>>> base_url_SRS = r'ftp://ftp.ngdc.noaa.gov/STP/swpc_products/daily_reports/solar_region_summaries/%Y/%m/%Y%m%dSRS.txt'
>>> scraper = Scrapper(base_url_SRS)
>>> t = TimeRange('2021-10-12', '2021-10-12')
>>> print(t)
(2021-10-12 00:00:00, 2021-10-12 00:00:00)
>>> for url in scraper.form_fileslist(t):
>>>     print(f'SRS url: {url}')
SRS url: ftp://ftp.ngdc.noaa.gov/STP/swpc_products/daily_reports/solar_region_summaries/2021/10/20211012SRS.txt
Example:

usage example based on RATAN (HTTP server)

>>> if int(year) < 2010 or (int(year) == 2010 and int(month) < 5):
>>>     return f'{year[:2]}{date_match[:-4]}-{date_match[-4:-2]}-{date_match[-2:]}'
>>> else:
>>>    f'{date_match[:-4]}-{date_match[-4:-2]}-{date_match[-2:]}'
>>> base_url_RATAN = 'http://spbf.sao.ru/data/ratan/%Y/%m/%Y%m%d_%H%M%S_sun+0_out.fits'
>>> regex_pattern_RATAN = '((\d{6,8})[^0-9].*[^0-9]0_out.fits)'
>>> scraper = Scrapper(base_url_RATAN, regex_pattern=regex_pattern_RATAN, condition=build_date)
>>> t = TimeRange('2010-01-13', '2010-01-13')
>>> for url in scraper.form_fileslist(t):
>>>     print(f'RATAN url: {url}')
RATAN url: http://spbf.sao.ru/data/ratan/2010/01/100113sun0_out.fits
ftpfiles(timerange: TimeRange) List[str][source]

Retrieve a list of files from an FTP server within the specified time range.

Parameters:

timerange – The TimeRange object representing the time range.

Returns:

A list of file URLs.

httpfiles(timerange: TimeRange) List[str][source]

Retrieve a list of files from an HTTP server within the specified time range.

Parameters:

timerange – The TimeRange object representing the time range.

Returns:

A list of file URLs.

range(timerange: TimeRange) List[str][source]

Generate a list of directories within the time range based on the smallest significant pattern.

Parameters:

timerange – The TimeRange object representing the time range.

Returns:

A list of directory paths.

static smallest_significant_pattern(pattern: str) relativedelta | None[source]

Determine the smallest significant pattern (e.g., seconds, minutes, days) in the given pattern. Some of them are here: https://fits.gsfc.nasa.gov/iso-time.html

Parameters:

pattern – The pattern string.

Returns:

The smallest significant relativedelta object, or None if not found.

valid_date_from_url(url: str) bool[source]

Validate if a given URL’s date matches the expected pattern from the base URL.

Parameters:

url – The URL string to validate.

Returns:

True if the URL’s date matches the pattern, False otherwise.

Module contents