Fluency Processiong Language Guide
Sections
Fluency Processing Language (FPL) - River Analytics (Ra) is a programming language designed to analyze and manipulate big data records.
function example()
search {from="-8h"} sContent("@tags","fpl-example-data")
let {id, isx5, isprime, odd, even, divisors} = f("@fields")
aggregate v=values(id) by divisors
let num_of_ints = listcount(v)
end
stream demo_table=example()
Basics
In teaching Ra, we will focus first on the search and analysis capabilities. This differs only slightly from a database query.
In a database query, there is a SELECT, FROM, WHERE and GROUP BY. In an analytic expression, this is a timeframe (SEARCH), WHERE, ASSIGN, AGGREGATE and BY. Just like in an SQL search, not all the key words have to appear.
A simple analytic expression can compose of just three steps:
Timeframe
Assignment of variables (columns)
Aggregation
For example, this is a search for the last seven (7) days where we count the number of events for each source.
Timeframe: The ‘search’ command determines the range and source of the search. When no source is given, the source is the main event table by default.
Assignment: We then assign the variable ‘source’ to the field @source by using the field function (f).
Aggregation: An aggregation is a function of the dataset for the provided timeframe. This function is simply count(). The variable ‘total’
search { from="-7d<d", to=">d" }
assign source=f("@source")
aggregate total=count() by source
The output of an expression is one or more tables. And these tables are what generate a visualization.
The rest of this introduction will walk through this example and explain the wording or the language. A language’s wording is purposely chosen to allow the programmer to understand how the system will progress the code.
Search defines the Timeframe
The selection of data is based on an expression that searches the entire dataset. When doing data analytics, the primary selection is to first define the timeframe. In a data lake this is defined with a start and stop time, while in streaming data it is defined by a sliding window of time. Regardless, the first statement in an analytic expression is the timeframe.
search { from="-7d<d", to=">d" }
The first line in the program is the ‘search’ command. The first difference you might notice is that the search is using curly brackets {}, not parentheses (). When brackets are used, the options will appear as key-value parings and are intended for the use by the command. When parentheses are used, a list of assignments will appear. Parentheses are used to pass assignments from to another function or block.
There are two options that appear for the search command are from and to. This tells the process to select data starting from a point in time and ending in a point in time.
The nomenclature for time is logical:
◦ m: minutes
◦ h: hours
◦ d: Day
◦ w: Week
◦ Mn: month
The greater than and less than signs can be seen as arrows on an X-Axis. A less than sign > points to the end of the time period while a greater than < points to the beginning. So,
◦ >d: means the end of the day
◦ <d: the start of the day
◦ >w: end of the week
◦ <w: start of the week
◦ >m: end of the minute.
There is a relative adjustment that is normally placed in from of the ‘from’ statement. In this example, we want the from to be the beginning of the day, seven days prior. The seven days prior are:
◦ -7d<d
The expression ‘7d<d’ is not the same as ‘<w’. The later is not seven days prior, but the start of the week. In US time, that is Sunday 12am midnight. This makes more sense when you want to know the data for this month:
◦ { from=“<Mn”, to to=“>Mn” }
Assignment
The ‘assign’ statement assigns the expression to the right to the variable on the left. It assigns, because this variable name references the column and will be used as a handler. The value of the column is calculated only once.
assign source=f("@source")
In this example, the field called ‘@source’ is assigned the variable handler of source.
The f() function, refers to the field. This the most common function in the Ra Programming language. Notice that f() has parentheses. It is a function, not a command. It returns an object of that field. In this case, this is a String called @source.
It could have returned a JSON object. In which, the left side of the expression can have multiple assignments. We will cover that later.
What is important to understand is that an ‘assign’ is a mapping from the record to a variable handler that represents a column.
Aggregate-By
This is the main difference between a query expression and an analytic. An aggregate is a function that performs a calculation over the dataset. An assignment is a relationship of the record to a variable, while an aggregate is a calculation of the dataset by the set defined by the ‘by’ command. In this case, it is ‘by’ the ‘source’
aggregate total=count() by source
Think of this like the GROUP BY in an sql query. The dataset divided into groups that share the value in ‘by’. Then the function counts the number of records in the group. Other examples, could have been:
◦ unique(): count the number of unique values in the set.
◦ max(): provide the largest value in the set.
◦ min(): provide the smallest value in the set.
◦ value(): create a list of unique values in the set.
Resulting Table
Each process has at least one table output. The default number of rows is set to ten (10). This can be changed with the sort
command:
search { from="-7d<d", to=">d" }
assign source=f("@source")
aggregate total=count() by source
sort 5 total
This will then generate a table and graph that has five (5) rows. The sort command is a value followed by the variable (column name).
Summary
Ra is a functional programming language that is designed to work on big data. The basic analytic expression is a timeframe, assignment and aggregate.
• The timeframe is defined by a form and to value. These are relatively defined by time periods of minute, hour, day, week, and month.
• The assignment command defines the columns and labels them with the variable name.
• The aggregate variables are the results of function for the dataset in the timeframe.
• A sort command determines the order of listing and number of maximum rows.