Tutorial
The Three Modules
undated consists of three main modules, undated
, undated.fmts
and undated.utils
.
These are distinctly separated for performance, to limit import overhead.
undated
The undated module, is the main module for manipulating dates, it consists of a main class object for managing a date as an integer and several functions for further functionality. All of these are hopefully self-explanatory.
undated.fmts
The fmts
(formattings) module comes into play when the date data format is unknown.
It has two key class objects. The Deriver
class derives the date format from a list
of dates, returning an UndatedFormat
class object, which contains the information
required by the as_parts
function to extract the date elements.
undated.utils
The utils module is tuned for performance. It has little to no validation on the parameters
and works solely with dates stored as integers in the Ymd
format
Important
The tutorial assumes that the undated, undated.fmts, and undated.utils packages have been imported as ud, udf and udu respectfully.
import undated as ud
import undated.fmts as udf
import undated.utils as udu
Limitations
undated
was designed with data processing of recent (-ish) dates in mind. Because of this it uses the Gregorian calendar
and makes no adjustment for the Julian calendar changeover. Thus, it will treat any dates before 1583 as invalid.
Common Parameter Names
Parameter names have been standardised as much as possible, to be more consistent, intuitive and understandable.
day: the day of the month as an integer
fmt: the date format string or can sometimes be the udf.UndatedFormat object
idate: a date integer of no specific format
iymd: an integer in the year month day format
month: the month as an integer, January == 1
sdate: a date string or integer of no specific format
year: the year as an integer
ymd: a ud.YMD class object
yy_pivot: for processing two-digit years, the lower bound for converting two-digit years to four-digit years
undated YMD Class
The class ud.YMD
is the go-to tool, when manipulation of the dates is required during processing.
import undated as ud
ymd = ud.YMD(2022_07_04)
print('ymd =', ymd)
print('plus 7 days =', ymd + 7)
print('minus 7 days =', ymd - 7)
print('add 2 months =', ymd.add_months(2))
print('minus 2 months =', ymd.add_months(-2))
print('add 10 weekdays =', ymd.add_weekdays(10))
print('minus 10 weekdays =', ymd.add_weekdays(-10))
print('add 3 years =', ymd.add_years(3))
print('minus 3 years =', ymd.add_years(-3))
print('day of the week =', ymd.day_of_week())
print('is leap year =', ymd.is_leap_year())
print('is weekday =', ymd.is_weekday())
print('the day, month and year =', ymd.day, ymd.month, ymd.year)
gives the results
ymd = 20220704
plus 7 days = 20220711
minus 7 days = 20220627
add 2 months = 20220904
minus 2 months = 20220504
add 10 weekdays = 20220718
minus 10 weekdays = 20220620
add 3 years = 20250704
minus 3 years = 20190704
day of the week = 1
is leap year = False
is weekday = True
the day, month and year = 4 7 2022
undated Functions
The add_days
, add_months
and add_years
functions offer an alternative approach to the same functionality as the YMD
class methods.
With the YMD class being passed as a parameter.
import undated as ud
ymd = ud.YMD(2022_07_04)
print('ymd =', ymd)
print('plus 7 days =', ud.add_days(ymd, 7))
print('minus 7 days =', ud.add_days(ymd, -7))
print('add 2 months =', ud.add_months(ymd, 2))
print('minus 2 months =', ud.add_months(ymd, -2))
print('add 10 weekdays =', ud.add_weekdays(ymd, 10))
print('minus 10 weekdays =', ud.add_weekdays(ymd, -10))
gives the results
ymd = 20220704
plus 7 days = 20220711
minus 7 days = 20220627
add 2 months = 20220904
minus 2 months = 20220504
add 10 weekdays = 20220718
minus 10 weekdays = 20220620
The undated
module also contains several between functions, that accept two YMD class objects.
These calculate the days between two dates, the complete months between two dates or the weekdays, Monday to Friday, between two dates.
import undated as ud
ymd1 = ud.YMD(2022_07_04)
ymd2 = ud.YMD(2024_05_30)
print('ymd1 ymd2 =', ymd1, ymd2)
print('days between =', ud.days_between(ymd1, ymd2))
print('months between =', ud.months_between(ymd1, ymd2))
print('weekdays between =', ud.weekdays_between(ymd1, ymd2))
gives the results
ymd1 ymd2 = 20220704 20240530
days between = 696
months between = 22
weekdays between = 498
Format Deriver
The udf.Deriver
class analyses the date data and tries to derive the format.
It’s designed with large amounts of data in mind, coming from various sources.
It loops through the data until it finds a date that can be of only one format.
Note
The deriver has been designed to solve the problem where different data sources provide dates in different formats. The deriver assumes that all dates from the same data source, IE those passed to its search method, are all in the same format.
The following examples uses the tutorial.csv file on GitHub. Each date column contains dates in different formats, to represent the different data files being received.
In this example, the deriver is passed a column of date data, date1
in this case, to derive.
import csv
import undated as ud
import undated.fmts as udf
with open('C:/Git/undated/docs/tutorial.csv', newline='') as csvfile:
data = list(csv.DictReader(csvfile))
dates = [row['date1'] for row in data] # Get the required dates into a list
deriver = udf.Deriver() # Initiate the deriver class
deriver.set_parameters({udf.LANGUAGES: 'EN'}) # Set language to English, only required for date3
fmt = deriver.search(dates) # Search for the date format
if fmt:
for ymd in [ud.YMD(udf.as_parts(date, fmt)) for date in dates]:
print(ymd) # The date is now an integer in Ymd format
else:
print('Format not derived')
which gives the results
20200204
20210525
20220831
20080423
20060502
20200229
None
20211120
20201013
20210104
Changing the date1
column to date2
or date3
will give the same output, as udf.Deriver
will evaluate the correct date format.
Tip
Only pass enough dates to the search to be sure of getting a match. If there’re thousands of rows, a few dozen may be enough to determine the format, and there’s always the option of further searches.
Further Date Formats
The Deriver
will try and derive the format from most common date presentations.
The code below is definitely not how the package has been designed to be used but it does show the various date formats that can be accepted.
import undated as ud
import undated.fmts as udf
def go(lists_of_dates):
for dates in lists_of_dates:
fmt = udf.Deriver().search(dates)
for sdate in dates:
ymd_parts = udf.as_parts(sdate, fmt)
print(ud.YMD(ymd_parts) if ymd_parts else 'Error', sdate, sep=' <- ')
go((
('20-mar-20', '21-apr-20', '22-may-20'),
('20mar20', '21apr20', '22may20'),
('11/25/2020 7:00PM Europe/Berlin',),
('25.11.2020 7:00PM Europe/Berlin',),
('Monday, 24 May 2021 05:50', 'Monday, 27 June 2021 05:50'),
('Mon, 25 Jan 2021 05:50:06 GMT', 'Mon, 27 Dec 2021 05:50:06 GMT'),
('Mon, 25 Jan 2021 05:50:06 GMT', 'Mon, 27 Dec 2021 05:50:06 GMT'),
('Mon, 25 Ene 2021 05:50:06 CET', 'Mon, 27 Dic 2021 05:50:06 CET'),
('12092022', '13092022'),
('2021-03-27T05:50:06.7199222-04:00',),
('03/28/2021 05:50:06',),
('29MAR2020', '01JAN2020'),
('Monday, 29 March 2021',),
('Monday, 29 March 2021 05:50 AM',),
('Monday, 29 March 2021 05:50:06',),
))
gives the results
20200320 <- 20-mar-20
20200421 <- 21-apr-20
20200522 <- 22-may-20
20200320 <- 20mar20
20200421 <- 21apr20
20200522 <- 22may20
20201125 <- 11/25/2020 7:00PM Europe/Berlin
20201125 <- 25.11.2020 7:00PM Europe/Berlin
20210529 <- Monday, 29 May 2021 05:50
20210629 <- Monday, 29 June 2021 05:50
20210129 <- Mon, 29 Jan 2021 05:50:06 GMT
20211229 <- Mon, 29 Dec 2021 05:50:06 GMT
20210129 <- Mon, 29 Jan 2021 05:50:06 GMT
20211229 <- Mon, 29 Dec 2021 05:50:06 GMT
20210129 <- Mon, 29 Ene 2021 05:50:06 CET
20211229 <- Mon, 29 Dic 2021 05:50:06 CET
20220912 <- 12092022
20220913 <- 13092022
20210327 <- 2021-03-27T05:50:06.7199222-04:00
20210328 <- 03/28/2021 05:50:06
20200329 <- 29MAR2020
20200101 <- 01JAN2020
20210329 <- Monday, 29 March 2021
20210329 <- Monday, 29 March 2021 05:50 AM
20210329 <- Monday, 29 March 2021 05:50:06
Month Languages
The observant may have spotted some Spanish months in the last example.
The Deriver
currently caters for English, French, German and Spanish, full and abbreviated months names.
If you know the language being used, setting it using the set_parameters
method can improve performance.
Which leads us on to…
Deriver set_parameters
To improve performance and assist with the format deriving process, the Deriver
class object can have parameters set.
Hints
Hints help the Deriver
, especially when there are fewer dates to use to derive the format.
Current hints are:
udf.Y2
the year is two-digitsudf.YFIRST
the year is in the first positionudf.YLAST
the year is in the last positionudf.YM
the date only includes the year and month
The following code applies the hints for two-digit years, and the year in the last position.
import undated.fmts as udf
my_date = '200122'
deriver = udf.Deriver()
deriver.set_parameters({udf.HINTS: [udf.Y2, udf.YLAST]})
fmt = deriver.search(my_date)
print(udf.as_parts(my_date, fmt))
gives the result
(2022, 1, 20)
Languages
If dates have text based months, the language can be set if it is known. This will improve performance and accuracy.
import undated.fmts as udf
my_date = '20-JAN-2022'
deriver = udf.Deriver()
deriver.set_parameters({udf.LANGUAGES: 'EN1'})
fmt = deriver.search(my_date)
print(udf.as_parts(my_date, fmt))
gives the result
(2022, 1, 20)
In the above case, the format would not be derivable without specifying the language, as JAN could be English or German.
The language parameter above is EN1
. The EN
refers to the language,
other valid options are DE
German, ES
Spanish and FR
French.
The 1
indicates that we are using the abbreviated months, 2
being for full month names.
For example, ES2
would be full Spanish month names, FR1
would be abbreviated French months.
Time Separator
Often date strings include the time, which is out of scope for the undated
package, so this needs to be removed.
The default time separator character is T
, following ISO standards. Space is also used.
If dates have another separator character, this can be specified. In this example the @
symbol has been used.
import undated.fmts as udf
my_date = '20-02-2022@12:55:55'
deriver = udf.Deriver()
deriver.set_parameters({udf.TIME_SEPARATOR: '@'})
fmt = deriver.search(my_date)
print(udf.as_parts(my_date, fmt))
gives the result
(2022, 2, 20)
YY Pivot
The udf.YY_PIVOT
property is used to determine how the century is applied to two-digit years.
By default, the undated
pivot year is the current year minus 80.
The default Excel pivot year is set as 40.
The value should be a four-digit year. 1940
would mean any two-digit year 40 or over would be given the century 19
.
Any two-digit year 39
and under will be given the century 20
.
import undated.fmts as udf
my_date = '20-01-35'
# Set the first deriver to 1940
deriver1 = udf.Deriver()
deriver1.set_parameters({udf.HINTS: [udf.Y2], udf.YY_PIVOT: 1940})
fmt1 = deriver1.search(my_date)
print(udf.as_parts(my_date, fmt1))
# Now try again with the pivot at 1930
deriver2 = udf.Deriver()
deriver1.set_parameters({udf.HINTS: [udf.Y2], udf.YY_PIVOT: 1930})
fmt2 = deriver1.search(my_date)
print(udf.as_parts(my_date, fmt2))
gives the result
(2035, 1, 20)
(1935, 1, 20)
String Formats
The UndatedFormat
object is not designed to be created manually. So, if the date format is simple and known, string-based formats can be used.
These use the letters YyMmd
along with the -
character to indicate a separator.
Y: 4-digit year
y: 2-digit year
M: the month as a string
m: the month as an integer
d: the day as an integer
-: separator character, indicates a space, comma, dot, slash or dash
import undated.fmts as udf
print(udf.as_parts(2021_06_12, fmt='Ymd'))
print(udf.as_parts('2021JUN12', fmt='YMd'))
print(udf.as_parts('21JUN12', fmt='yMd'))
print(udf.as_parts('12/JUN/2021', fmt='d-M-Y'))
print(udf.as_parts('12-JUN-21', fmt='d-M-y'))
print(udf.as_parts('12.06.21', fmt='d-m-y'))
gives the result (for each print)
(2021, 6, 12)
To go the next step and convert to ud.YMD
or datetime
import datetime
import undated as ud
import undated.fmts as udf
parts = udf.as_parts('12-JUN-21', fmt='d-M-y')
print(ud.YMD(parts))
print(datetime.datetime(*parts))
gives the result
20210612
2021-06-12 00:00:00