From bf8ae7a6be0a754b8d5c2a9f14d552b1697e4936 Mon Sep 17 00:00:00 2001 From: Nathan Ford Date: Thu, 25 May 2017 11:52:12 -0700 Subject: [PATCH] added string comparison functions added string comparison functions section in documentation comparison_with_r.rst --- doc/source/comparison_with_r.rst | 97 ++++++++++++++++++++++++++++++++ 1 file changed, 97 insertions(+) diff --git a/doc/source/comparison_with_r.rst b/doc/source/comparison_with_r.rst index 194e022e34c7c..b57eab050823b 100644 --- a/doc/source/comparison_with_r.rst +++ b/doc/source/comparison_with_r.rst @@ -530,6 +530,103 @@ For more details and examples see :ref:`categorical introduction ` :ref:`differences to R's factor `. +String Processing +----------------- + +Length +~~~~~~ + +R determines the length of a character string with the ``nchar`` function. +``nchar`` includes leading and trailing blanks. Use ``nchar`` and ``trimws`` +to exclude leading and trailing blanks. + +.. code-block:: none + + df <- data.frame(color = c('red', ' blue', 'green ', ' yellow ')) + nchar(as.character(df$color)) + nchar(trimws(as.character(df$color))) + +Python determines the length of a character string with the ``len`` function. +``len`` includes leading and trailing blanks. Use ``len`` and ``strip`` +to exclude leading and trailing blanks. + +.. code-block:: none + + df = pd.DataFrame({'color': ['red', ' blue', 'green ', ' yellow ']}) + df['color'].str.len() + df['color'].str.strip().str.len() + + +Find Position +~~~~~~~~~~~~~ + +R determines the position of a character in a string with the +``regexpr`` function. ``regexpr`` takes the string defined by +the first argument and searches for the first position of the substring +you supply as the second argument. + +.. code-block:: none + + df <- data.frame(sex = c('MALE', 'FEMALE')) + pos = regexpr("ALE", df$sex) + pos[1:2] + +Python determines the position of a character in a string with the +``find`` function. ``find`` searches for the first position of the +substring. If the substring is found, the function returns its +position. Keep in mind that Python indexes are zero based whereas +R indexes are 1 based. + +.. code-block:: none + + df = pd.DataFrame({'sex': ['MALE', 'FEMALE']}) + df['sex'].str.find("ALE") + +Substring +~~~~~~~~~ + +R extracts a substring from a string based on its position +with the ``substr`` function. + +.. code-block:: none + + df <- data.frame(sex = c('MALE', 'FEMALE')) + substr(df$sex, 1, 1) + +In Python, you can use ``[]`` notation to extract a substring +from a string by position locations. Keep in mind that Python +indexes are zero-based. + +.. code-block:: none + + df = pd.DataFrame({'sex': ['MALE', 'FEMALE']}) + df['sex'].str[0:1] + + +Upcase and Lowcase +~~~~~~~~~~~~~~~~~~ + +The R ``toupper`` and ``tolower`` functions change the case of the +character string. + +.. code-block:: none + + df <- data.frame(name = c('Johnny Bravo', 'Alex Mack')) + toupper(df$name) + tolower(df$name) + +The equivalent Python functions are ``upper`` and ``lower``. +In addition, Python's ``title`` function changes the string to +proper case. + +.. code-block:: none + + df = pd.DataFrame({'name': ['Johnny Bravo', 'Alex Mack']}) + df['name'].str.upper() + df['name'].str.lower() + df['name'].str.title() + + .. |c| replace:: ``c`` .. _c: http://stat.ethz.ch/R-manual/R-patched/library/base/html/c.html